image

Series Editor
Patrick Paroubek

Natural Language Processing and Computational Linguistics 1

Speech, Morphology and Syntax

Mohamed Zakaria Kurdi

image

Introduction

Language is one of the central tools in our social and professional life. Among other things, it acts as a medium for transmitting ideas, information, opinions and feelings, as well as for persuading, asking for information, giving orders, etc. Computer Science began to gain an interest in language as soon as the field itself emerged, notably within the field of Artificial Intelligence (AI). The Turing test, one of the first tests developed to judge whether a machine is intelligent or not, stipulates that to be considered intelligent, a machine must possess conversational abilities that are comparable to those of a human being [TUR 50]. This implies that an intelligent machine must possess comprehension and production abilities, in the broadest sense of these terms. Historically, natural language processing (NLP) got itself focused on the potential for applying such technology to the real world in a very short span of time, particularly with machine translation (MT) during the Cold War. This began with the first machine translation system which was developed as the brainchild of a joint project between the University of Georgetown and IBM in the United States [DOS 55, HUT 04]. This work was not crowned with the success that was expected, as the researchers soon realized that a deep understanding of the linguistic system is a prerequisite for any comprehensive application of this kind. This discovery, presented in the famous report by automatic language processing advisory committee (ALPAC), had a considerable impact upon machine translation work and on the field of NLP in general. Today, even though NLP is largely industrialized, the interest in basic language processing has not waned. In fact, whatever the application of modern NLP, the use of a basic language processing unit such as a morphological, syntactic, recognition or speech synthesis analyzer is almost always indispensable (see [JON 11] for a more complete review of the history of NLP).

I.1. The definition of NLP

Firstly, what is NLP? It is a discipline which is found at the intersection of several other branches of science such as Computer Science, Artificial Intelligence and Cognitive Psychology. In English, there are several terms for certain fields which are very close to one another. Even though the boundaries between these designated fields are not always very clear, we are going to try to give a definition without claiming that the definition is unanimously accepted in the community. For example, the terms formal linguistics or computational linguistics relate more to models or linguistic formalities developed for IT implementation. The terms Human Language Technology or Natural Language Processing, on the other hand, refer to a publishing software tool equipped with features related to language processing. Furthermore, speech processing designates a range of techniques from signal processing to the recognition or production of linguistic units such as phonemes, syllables or words. Except for the dimension dealing with the signal processing, there is no major difference between speech processing and NLP. Many techniques that have initially been applied to speech processing have found their way into applications in NLP, an example being the Hidden Markov Models (HMM). This encouraged us to follow the unifying path already taken by other colleagues, such as [JUR 00], in this book. This path involves grouping NLP and speech processing into the same discipline. Finally, it is probably worth to mention the term corpus linguistics which refers to the methods of collection, annotation and use of corpora, both in linguistic research and NLP. Since corpora have a very important role in the process of constructing an NLP system, notably those which adopt a machine learning approach, we saw fit to consider corpus linguistics as a branch of NLP.

In the following sections, we will present and discuss the relationships between NLP and related disciplines such as linguistics, AI and cognitive science.

I.1.1. NLP and linguistics

Today, with the democratization of NLP tools, such tools make up the toolkit of many linguists conducting empirical work across a corpus. Therefore, Part-Of-Speech (POS) taggers, morphological analyzers and syntactic parsers of different types are often used in quantitative studies.

They may also be used to provide the necessary data for a psycholinguistics experiment. Furthermore, NLP offers linguists and cognitive scientists a new perspective by adding a new dimension to research carried out within these fields. This new dimension is testability. Indeed, many theoretical models have been tested empirically with the help of NLP applications.

I.1.2. NLP and AI

AI is the study, design and creation of intelligent agents. An intelligent agent is a natural or artificial system with perceptual abilities that allows it to act in a given environment to satisfy its desires or successfully achieve planned objectives (see [MAR 14a] and [RUS 10] for a general introduction). Work in AI is generally classified into several sub-disciplines or branches, such as knowledge representation, planning, perception and learning. All these branches are directly related to NLP. This gives the relationship between AI and NLP a very important dimension. Many consider NLP to be a branch of AI while some prefer to consider NLP a more independent discipline.

In the field of AI, planning involves finding the steps to follow to achieve a given goal. This is achieved based on a description of the initial states and possible actions. In the case of an NLP system, planning is necessary to perform complex tasks involving several sources of knowledge that must cooperate to achieve the final goal.

Knowledge representation is important for an NLP system at two levels. On the one hand, it can provide a framework to represent the linguistic knowledge necessary for the smooth functioning of the whole NLP system, even if the size and the quantity of the declarative pieces of information in the system vary considerably according to the approach chosen. On the other hand, some NLP systems require extralinguistic information to make decisions, especially in ambiguous cases. Therefore, certain NLP systems are paired with ontologies or with knowledge bases in the form of a semantic network, a frame or conceptual graphs.

In theory, perception and language seem far from one another, but in reality, this is not the case, especially when we are talking about spoken language where the linguistic message is conveyed by sound waves produced by the vocal folds. Making the connection between perception and voice recognition (the equivalent of perception with a comprehension element) is crucial, not only for comprehension, but also to improve the quality of speech recognition. Furthermore, some current research projects are looking at the connection between the perception of spoken language and the perception of visual information.

Machine learning involves building a representation after having examined data which may or may not have previously been analyzed. Since the 2000s, machine learning has gained particular attention within the field of AI, thanks to the opportunities it offers, allowing intelligent systems to be built with minimal effort compared to rule-based symbolic systems which require more work to be done by human experts. In the field of NLP, the extent to which basic machine learning is used depends highly on the targeted linguistic level. The extent to which machine learning is used varies between almost total domination within speech recognition systems and limited usage within high level processing such as in discourse analysis and pragmatics, where the symbolic paradigm is still dominant.

I.1.3. NLP and cognitive science

As with linguistics, the relationship between cognitive science and NLP goes in two directions. On the one hand, cognitive models can act to support a source of inspiration for an NLP system. On the other hand, constructing an NLP system according to a cognitive model can be a way of testing this model. The practical benefit of an approach which mimics the cognitive process remains an open question because in many fields, constructing a system which is inspired by biological models does not prove to be productive. It should also be noted that certain tasks carried out by NLP systems have no parallel in humans, such as searching for information across search engines or searching through large volumes of text data to extract useful information. NLP can be seen as an extension of human cognitive capabilities as part of a decision support system, for example. Other NLP systems are very close to human tasks, such as comprehension and production.

I.1.4. NLP and data science

With the availability of more and more digital data, a new discipline has recently emerged: data science. It involves extracting, quantifying and visualizing knowledge, primarily from textual and spoken data. Since these data are found in natural language in many cases, the role of NLP in the extraction and treatment process is obvious. Currently, given the countless industrial uses for this kind of knowledge, especially within the fields of marketing and decision-making, data science has become extremely important, even reminiscent of the beginning of the Internet in the 1990s. This shows that NLP is as useful when applied as it is when considered as a research field.

I.2. The structure of this book

The aim of this book is to give a panoramic overview of both early and modern research in the field of NLP. It aims to give a unified vision of fields which are often considered as being separate, for example speech processing, computational linguistics, NLP and knowledge engineering. It aims to be profoundly interdisciplinary and tries to consider the various linguistic and cognitive models as well as the algorithms and computational applications on an equal footing. The main postulate adopted in this book is that the best results can only be the outcome of a solid theoretical backbone and a well thought-out empirical approach. Of course, we are not claiming that this book covers the entirety of the works that have been done, but we have tried to strike a balance between North American, European and international work. Our approach is thus based on a duel perspective, aiming to be accessible and informative on the one hand but on the other, presenting the state-of-the-art of a mature field which is in a constant state of evolution.

As a result, this work uses an approach that consists of making linguistic and computer science concepts accessible by using carefully chosen examples. Furthermore, even though this book seeks to give the maximum amount of detail possible about the approaches presented, it nevertheless remains neutral about implementation details to leave each individual some freedom regarding the choice of a programming language. This must be chosen according to personal preference as well as the specific objective needs of individual projects.

Besides the introduction, this book is made up of four chapters. The first chapter looks at the linguistic resources used in NLP. It presents the different types of corpora that exist, their collection, as well as their methods of annotation. The second chapter discusses speech and speech processing. Firstly, we will present the fundamental concepts in phonetics and phonology and then we will move to the two most important applications in the field of speech processing: recognition and synthesis. The third chapter looks at the word level and it focuses particularly on morphological analysis. Finally, the fourth chapter covers the field of syntax. The fundamental concepts and the most important syntactic theories are presented, as well as the different approaches to syntactic analysis.