Table of Contents

Cover

Title

Introduction

I.1. The definition of NLP
I.2. The structure of this book

1 Linguistic Resources for NLP

1.1. The concept of a corpus
1.2. Corpus taxonomy
1.3. Who collects and distributes corpora?
1.4. The lifecycle of a corpus
1.5. Examples of existing corpora

2 The Sphere of Speech

2.1. Linguistic studies of speech
2.2. Speech processing

3 Morphology Sphere

3.1. Elements of morphology
3.2. Automatic morphological analysis

4 Syntax Sphere

4.1. Basic syntactic concepts
4.2. Elements of formal syntax
4.3. Syntactic formalisms
4.4. Automatic parsing

Bibliography

Index

End User License Agreement

List of Illustrations

1 Linguistic Resources for NLP

Figure 1.1. Extract from a parallel corpus [MCE 96]
Figure 1.2. Lifecycle of a corpus
Figure 1.3. Data collection system using the Wizard of Oz method
Figure 1.4. Diagram of a corpus data collection system using a prototype
Figure 1.5. Transcription example using the software Transcriber
Figure 1.6. Segment of a corpus analyzed using parts of speech
Figure 1.7. Extract from the Penn Treebank
Figure 1.8. Extract from a tree corpus for French
Figure 1.9. Semantic annotation with a has_target relationship

2 The Sphere of Speech

Figure 2.1. Communication system
Figure 2.2. Speech organs
Figure 2.3. Position of the soft palate during the production of French vowels
Figure 2.4. Parts and aperture of the tongue
Figure 2.5. Degree of aperture
Figure 2.6. Displacement of air molecules by the vibrations of a tuning fork
Figure 2.7. Frequency and amplitude of a simple wave
Figure 2.8. An aperiodic wave
Figure 2.9. Analysis of a complex wave
Figure 2.10. A collection of tuning forks plays the role of a spectrograph
Figure 2.11. Spectrogram of a French speaker saying “la rose est rouge” generated using the Prat software
Figure 2.12. Spectrograms of the French vowels: [a], [i] and [u]
Figure 2.13. Spectrograms of several non-sense words with consonants in the center
Figure 2.14. Physiology of the ear
Figure 2.15. Lip rounding
Figure 2.16. Front vowels and back vowels
Figure 2.17. French vowel trapezium
Figure 2.18. Nasal and oral consonants
Figure 2.19. Examples of some possible syllabic structures in French
Figure 2.20. Examples of how double consonants are dealt with by the timing tier
Figure 2.21. Propagation of nasality in Warao
Figure 2.22. General architecture of speech recognition systems
Figure 2.23. Markovian model of Xavier’s moods
Figure 2.24. HMM diagram of Xavier’s behavior and his moods
Figure 2.25. Markov chain for the word “ouvre” (open)
Figure 2.26. Markov chain for the recognition of vocal commands
Figure 2.27. HMM for the word “ouvre” (open)
Figure 2.28. Trellis with three possible paths
Figure 2.29. Typical architecture of an SS system
Figure 2.30. General architecture of a concatenation synthesis system
Figure 2.31. Serial and parallel architecture of formant speech synthesis systems

3 Morphology Sphere

Figure 3.1. FSM for expressions of encouragement
Figure 3.2. Examples of regular expressions with their FSM equivalence
Figure 3.3. Conjugation of the verbs poser and porter in the present indicative tense
Figure 3.4. Correspondence pair for the word houses
Figure 3.5. FST for some words in French with the prefix “anti–”
Figure 3.6. Partial FST for the derivation of some French words
Figure 3.7. Kay and Kaplan diagram
Figure 3.8. Xerox approach to the use of FST in morphological analysis
Figure 3.9. A micro-text tagged with POS
Figure 3.10. Tag sequences for “The written history of the Gauls is known”
Figure 3.11. Architecture of the Brill tagger [BRI 95]
Figure 3.12. Example of transformation-based learning

4 Syntax Sphere

Figure 4.1. The role of grammar according to Chomsky
Figure 4.2. Relationships in the framework of formalism, WG [HUD 10]
Figure 4.3. Analysis of a simple sentence by the formalism of WG
Figure 4.4. Example of an analysis by chunks [ABN 91a]
Figure 4.5. Example of attachment ambiguity of a prepositional phrase
Figure 4.6. Syntax trees of some noun phrases
Figure 4.7. Grammar for the structures as shown in Figure 4.6
Figure 4.8. Syntax trees and rewrite rules of an adjective phrase
Figure 4.9. Grammar for the structures presented in Figure 4.8
Figure 4.10. Grammar for the noun phrase with a recursion
Figure 4.11. Examples of VP with different complement types
Figure 4.12. Analysis of two types of sentences with two types of complements
Figure 4.13. Example of analysis of two relative sentences
Figure 4.14. Examples of the coordination of two phrases and two sentences
Figure 4.15. Two syntax tree for a syntactically ambiguous sentence
Figure 4.16. Hierarchy of formal grammars
Figure 4.17. Grammar for the language aⁿ bⁿ cⁿ
Figure 4.18. Syntax tree for the strings: abc and aabbcc
Figure 4.19. The derivation of strings: ab, aabb, aaabbb
Figure 4.20. Example of a grammar in Chomsky normal form with examples of syntax trees
Figure 4.21. Syntax tree of an NP in Chomsky normal form
Figure 4.22. Example of grammar in Greibach normal form
Figure 4.23. Regular grammar that generates the language aⁿb^m
Figure 4.24. Types of branching in complex sentences
Figure 4.25. Type-2 grammar modified to account for the agreement
Figure 4.26. Feature structures of the noun “house” and of the verb “love”
Figure 4.27. CFS of a simple sentence
Figure 4.28. Feature graphs for the agreement feature for the words “house” and “love”
Figure 4.29. Example of structures of shared value and of a reentrant structure
Figure 4.30. Example of structures of shared value and of a reentrant structure
Figure 4.31. Examples of feature structures with subsumption relationships
Figure 4.32. Examples of unifications
Figure 4.33. DCG Grammar
Figure 4.34. DCG enriched with FS
Figure 4.35. Rewrite rule and syntax tree of a complex noun phrase
Figure 4.36. Examples of phrases with their heads
Figure 4.37. Diagrams of the two basic rules
Figure 4.38. Examples of noun phrases
Figure 4.39. Diagram and example of a determiner phrase according to [ABN 87]
Figure 4.40. Example of the processing of a verb phrase with the X-bar theory
Figure 4.41. Diagram and example of analysis of entire sentences
Figure 4.42. Analysis of a completive subordinate
Figure 4.43. Diagram of a typed FS in HPSG
Figure 4.44. Simplified lexical entry of “house”
Figure 4.45. Some abbreviations of FS in HPSG
Figure 4.46. Enriched FS of the words “house” and “John”
Figure 4.47. Some simplified FS of verbs
Figure 4.48. FS of the verb “sees”
Figure 4.49. General diagram of l-rules
Figure 4.50. Rule of plural
Figure 4.51. Rule of derivation of an agent noun from the verb
Figure 4.52. Head-Complement Rule
Figure 4.53. Head-Complement Rule applied to a transitive verb
Figure 4.54. Head-Modifier Rule
Figure 4.55. Head-Specifier Rule
Figure 4.56. Lexical entry of the determiner “the”
Figure 4.57. Feature structures of the noun phrase: the house
Figure 4.58. Analysis of the verb phrase: sees the house
Figure 4.59. The FS of the pronoun “the”
Figure 4.60. The analysis of the sentence: he sees the house
Figure 4.61. Examples of initial and auxiliary elementary trees
Figure 4.62. Diagram and example of substitution in LTAG
Figure 4.63. General diagram and example of adjunction
Figure 4.64. An example of a derived tree and a corresponding derivation tree
Figure 4.65. Examples of feature structures associated with elementary trees
Figure 4.66. An example of a substitution with unification
Figure 4.67. Diagram of an addition with unification
Figure 4.68. Example of a recursive transition network
Figure 4.69. A DCG and the corresponding RTNs TRVIDF PP
Figure 4.70. Context-free grammars for the parsing of a fragment
Figure 4.71. Example of parsing with a top-down algorithm
Figure 4.72. Basic top-down algorithms
Figure 4.73. Micro-grammar with a left recursion
Figure 4.74. Left recursion with a top-down algorithm
Figure 4.75. Example of parsing with a bottom-up algorithm
Figure 4.76. Basic top-down algorithms
Figure 4.77. CFG Grammar
Figure 4.78. Repeated backtracking with a top-down algorithm
Figure 4.79. Left-corner algorithm
Figure 4.80. Example of parsing with the left-corner algorithm
Figure 4.81. Table of an incomplete parsing
Figure 4.82. Table of a complete parsing of a sentence
Figure 4.83. Partial active chart
Figure 4.84. Diagram of the first fundamental rule
Figure 4.85. Example of application of the fundamental rule
Figure 4.86. Tabular parsing algorithm with a bottom-up approach
Figure 4.87. Example of a probabilistic context-free grammar for a fragment of French
Figure 4.88. Parsing tree for a sentence from the PCFG of the Figure 4.87
Figure 4.89. Supervised learning of a PCFG
Figure 4.90. General structure of the parse table of the CYK algorithm
Figure 4.91. The first step in the execution of the CYK algorithm
Figure 4.92. The second step in the execution of the CYK algorithm
Figure 4.93. The third step in the execution of the CYK algorithm
Figure 4.94. The fourth step in the execution of the CYK algorithm
Figure 4.95. Architecture of a neural network for handwritten digit recognition [NIE 14]
Figure 4.96. Example of a recurring network

List of Tables

2 The Sphere of Speech

Table 2.1. Examples of IPA transcriptions from French and English
Table 2.2. The three first formants of the vowels [a], [i] and [u]
Table 2.3. Examples of rounded and unrounded vowels in French
Table 2.4. Nasal vowels in French
Table 2.5. Oral vowels in French
Table 2.6. Places of articulation of French consonants
Table 2.7. French semi-vowels
Table 2.8. Examples of distinctive features according to the taxonomy by Chomsky and Halle [CHO 68]
Table 2.9. Constraint forbidding three successive consonants in Egyptian Arabic
Table 2.10. Constraints involved in the case of joining (liaison) in French
Table 2.11. Classification parameters of speech recognition systems
Table 2.12. Probabilities of Xavier’s moods tomorrow, with the knowledge of his mood today
Table 2.13. Probability of Xavier’s behavior, knowing his mood
Table 2.14. Micro-corpus unigrams
Table 2.15. Bigrams in the micro-corpus with their frequencies
Table 2.16. Abbreviations to be normalized before synthesis
Table 2.17. Examples of transcriptions with the Arpabet format

3 Morphology Sphere

Table 3.1. Examples of Arabic words derived from the stem k-t-b
Table 3.2. Examples of words in Turkish
Table 3.3. Examples of prefixes commonly used in English
Table 3.4. Examples of suffixes commonly used in English
Table 3.5. Examples of collocations in three French literary corpora [LEG 12]
Table 3.6. Examples of colligation
Table 3.7. Successors of the word read [FRA 92]
Table 3.8. Bigrams of the words bonbon and bonbonne
Table 3.9. Some regular expressions with simple sequences
Table 3.10. Regular expressions with character categories
Table 3.11. Priority of operators in regular expressions
Table 3.12. FSM transition table for expressions of encouragement
Table 3.13. A minimal list of tags

4 Syntax Sphere

Table 4.1. Clefting patterns
Table 4.2. Examples of restrictive negation
Table 4.3. A few examples of variation of the word order at the oral framework
Table 4.4. Examples of noun phrases and their morphological sequences
Table 4.5. Summary of formal grammars
Table 4.6. Adopted notation and variants in the literature
Table 4.7. Types in HPSG formalism [POL 97]
Table 4.8. Labels adopted for the annotation of RTN
Table 4.10. Table of left-corners of the grammar of the Figure 4.70
Table 4.11. Summary of spaces required by the three parsing approaches [RES 92a]

Introduction

Language is one of the central tools in our social and professional life. Among other things, it acts as a medium for transmitting ideas, information, opinions and feelings, as well as for persuading, asking for information, giving orders, etc. Computer Science began to gain an interest in language as soon as the field itself emerged, notably within the field of Artificial Intelligence (AI). The Turing test, one of the first tests developed to judge whether a machine is intelligent or not, stipulates that to be considered intelligent, a machine must possess conversational abilities that are comparable to those of a human being [TUR 50]. This implies that an intelligent machine must possess comprehension and production abilities, in the broadest sense of these terms. Historically, natural language processing (NLP) got itself focused on the potential for applying such technology to the real world in a very short span of time, particularly with machine translation (MT) during the Cold War. This began with the first machine translation system which was developed as the brainchild of a joint project between the University of Georgetown and IBM in the United States [DOS 55, HUT 04]. This work was not crowned with the success that was expected, as the researchers soon realized that a deep understanding of the linguistic system is a prerequisite for any comprehensive application of this kind. This discovery, presented in the famous report by automatic language processing advisory committee (ALPAC), had a considerable impact upon machine translation work and on the field of NLP in general. Today, even though NLP is largely industrialized, the interest in basic language processing has not waned. In fact, whatever the application of modern NLP, the use of a basic language processing unit such as a morphological, syntactic, recognition or speech synthesis analyzer is almost always indispensable (see [JON 11] for a more complete review of the history of NLP).

I.1. The definition of NLP

Firstly, what is NLP? It is a discipline which is found at the intersection of several other branches of science such as Computer Science, Artificial Intelligence and Cognitive Psychology. In English, there are several terms for certain fields which are very close to one another. Even though the boundaries between these designated fields are not always very clear, we are going to try to give a definition without claiming that the definition is unanimously accepted in the community. For example, the terms formal linguistics or computational linguistics relate more to models or linguistic formalities developed for IT implementation. The terms Human Language Technology or Natural Language Processing, on the other hand, refer to a publishing software tool equipped with features related to language processing. Furthermore, speech processing designates a range of techniques from signal processing to the recognition or production of linguistic units such as phonemes, syllables or words. Except for the dimension dealing with the signal processing, there is no major difference between speech processing and NLP. Many techniques that have initially been applied to speech processing have found their way into applications in NLP, an example being the Hidden Markov Models (HMM). This encouraged us to follow the unifying path already taken by other colleagues, such as [JUR 00], in this book. This path involves grouping NLP and speech processing into the same discipline. Finally, it is probably worth to mention the term corpus linguistics which refers to the methods of collection, annotation and use of corpora, both in linguistic research and NLP. Since corpora have a very important role in the process of constructing an NLP system, notably those which adopt a machine learning approach, we saw fit to consider corpus linguistics as a branch of NLP.

In the following sections, we will present and discuss the relationships between NLP and related disciplines such as linguistics, AI and cognitive science.

I.1.1. NLP and linguistics

Today, with the democratization of NLP tools, such tools make up the toolkit of many linguists conducting empirical work across a corpus. Therefore, Part-Of-Speech (POS) taggers, morphological analyzers and syntactic parsers of different types are often used in quantitative studies.

They may also be used to provide the necessary data for a psycholinguistics experiment. Furthermore, NLP offers linguists and cognitive scientists a new perspective by adding a new dimension to research carried out within these fields. This new dimension is testability. Indeed, many theoretical models have been tested empirically with the help of NLP applications.

I.1.2. NLP and AI

AI is the study, design and creation of intelligent agents. An intelligent agent is a natural or artificial system with perceptual abilities that allows it to act in a given environment to satisfy its desires or successfully achieve planned objectives (see [MAR 14a] and [RUS 10] for a general introduction). Work in AI is generally classified into several sub-disciplines or branches, such as knowledge representation, planning, perception and learning. All these branches are directly related to NLP. This gives the relationship between AI and NLP a very important dimension. Many consider NLP to be a branch of AI while some prefer to consider NLP a more independent discipline.

In the field of AI, planning involves finding the steps to follow to achieve a given goal. This is achieved based on a description of the initial states and possible actions. In the case of an NLP system, planning is necessary to perform complex tasks involving several sources of knowledge that must cooperate to achieve the final goal.

Knowledge representation is important for an NLP system at two levels. On the one hand, it can provide a framework to represent the linguistic knowledge necessary for the smooth functioning of the whole NLP system, even if the size and the quantity of the declarative pieces of information in the system vary considerably according to the approach chosen. On the other hand, some NLP systems require extralinguistic information to make decisions, especially in ambiguous cases. Therefore, certain NLP systems are paired with ontologies or with knowledge bases in the form of a semantic network, a frame or conceptual graphs.

In theory, perception and language seem far from one another, but in reality, this is not the case, especially when we are talking about spoken language where the linguistic message is conveyed by sound waves produced by the vocal folds. Making the connection between perception and voice recognition (the equivalent of perception with a comprehension element) is crucial, not only for comprehension, but also to improve the quality of speech recognition. Furthermore, some current research projects are looking at the connection between the perception of spoken language and the perception of visual information.

Machine learning involves building a representation after having examined data which may or may not have previously been analyzed. Since the 2000s, machine learning has gained particular attention within the field of AI, thanks to the opportunities it offers, allowing intelligent systems to be built with minimal effort compared to rule-based symbolic systems which require more work to be done by human experts. In the field of NLP, the extent to which basic machine learning is used depends highly on the targeted linguistic level. The extent to which machine learning is used varies between almost total domination within speech recognition systems and limited usage within high level processing such as in discourse analysis and pragmatics, where the symbolic paradigm is still dominant.

I.1.3. NLP and cognitive science

As with linguistics, the relationship between cognitive science and NLP goes in two directions. On the one hand, cognitive models can act to support a source of inspiration for an NLP system. On the other hand, constructing an NLP system according to a cognitive model can be a way of testing this model. The practical benefit of an approach which mimics the cognitive process remains an open question because in many fields, constructing a system which is inspired by biological models does not prove to be productive. It should also be noted that certain tasks carried out by NLP systems have no parallel in humans, such as searching for information across search engines or searching through large volumes of text data to extract useful information. NLP can be seen as an extension of human cognitive capabilities as part of a decision support system, for example. Other NLP systems are very close to human tasks, such as comprehension and production.

I.1.4. NLP and data science

With the availability of more and more digital data, a new discipline has recently emerged: data science. It involves extracting, quantifying and visualizing knowledge, primarily from textual and spoken data. Since these data are found in natural language in many cases, the role of NLP in the extraction and treatment process is obvious. Currently, given the countless industrial uses for this kind of knowledge, especially within the fields of marketing and decision-making, data science has become extremely important, even reminiscent of the beginning of the Internet in the 1990s. This shows that NLP is as useful when applied as it is when considered as a research field.

I.2. The structure of this book

The aim of this book is to give a panoramic overview of both early and modern research in the field of NLP. It aims to give a unified vision of fields which are often considered as being separate, for example speech processing, computational linguistics, NLP and knowledge engineering. It aims to be profoundly interdisciplinary and tries to consider the various linguistic and cognitive models as well as the algorithms and computational applications on an equal footing. The main postulate adopted in this book is that the best results can only be the outcome of a solid theoretical backbone and a well thought-out empirical approach. Of course, we are not claiming that this book covers the entirety of the works that have been done, but we have tried to strike a balance between North American, European and international work. Our approach is thus based on a duel perspective, aiming to be accessible and informative on the one hand but on the other, presenting the state-of-the-art of a mature field which is in a constant state of evolution.

As a result, this work uses an approach that consists of making linguistic and computer science concepts accessible by using carefully chosen examples. Furthermore, even though this book seeks to give the maximum amount of detail possible about the approaches presented, it nevertheless remains neutral about implementation details to leave each individual some freedom regarding the choice of a programming language. This must be chosen according to personal preference as well as the specific objective needs of individual projects.

Besides the introduction, this book is made up of four chapters. The first chapter looks at the linguistic resources used in NLP. It presents the different types of corpora that exist, their collection, as well as their methods of annotation. The second chapter discusses speech and speech processing. Firstly, we will present the fundamental concepts in phonetics and phonology and then we will move to the two most important applications in the field of speech processing: recognition and synthesis. The third chapter looks at the word level and it focuses particularly on morphological analysis. Finally, the fourth chapter covers the field of syntax. The fundamental concepts and the most important syntactic theories are presented, as well as the different approaches to syntactic analysis.

Natural Language Processing and Computational Linguistics 1

Speech, Morphology and Syntax

Introduction

I.1. The definition of NLP

I.1.1. NLP and linguistics

I.1.2. NLP and AI

I.1.3. NLP and cognitive science

I.1.4. NLP and data science

I.2. The structure of this book