Table of Contents

Cover

Title

Introduction

1 The Sphere of Lexicons and Knowledge

1.1. Lexical semantics
1.2. Lexical databases
1.3. Knowledge representation and ontologies

2 The Sphere of Semantics

2.1. Combinatorial semantics
2.2. Formal semantics

3 The Sphere of Discourse and Text

3.1. Discourse analysis and pragmatics
3.2. Computational approaches to discourse

4 The Sphere of Applications

4.1. Software engineering for NLP software
4.2. Machine translation (MT)
4.3. Information retrieval (IR)
4.4. Big Data (BD) and information extraction

Conclusion

Bibliography

Index

End User License Agreement

List of Tables

1 The Sphere of Lexicons and Knowledge

Table 1.1. Examples of the connotations of the color red
Table 1.2. The metaphor of life as a voyage
Table 1.3. The metaphor of humans as machines and machines as humans
Table 1.4. Examples of polysemy
Table 1.5. Examples of partial synonyms
Table 1.6. A semic analysis of the field of the chair according to Pottier
Table 1.7. Comparison of the attributes of sparrows, ostriches and kiwis
Table 1.8. Simplified frame of a plane

2 The Sphere of Semantics

Table 2.1. Analyses of semias: train, metro, bus and coach
Table 2.2. Truth table for the negation operator
Table 2.3. Truth values for the conjunction operator
Table 2.4. Truth table for the or operator
Table 2.5. Truth table for the exclusive or operator
Table 2.6. Truth table for the implication operator
Table 2.7. Truth table for the biconditional
Table 2.8. Truth table for formula α
Table 2.9. Proof of P ⋀ P ≡ P
Table 2.10. Proof of P ⋀ Q ≡ Q ⋀P
Table 2.11. Proof of (p → (Q ∨ R)) ≡ ((P → Q) ∨ (P → R))

3 The Sphere of Discourse and Text

Table 3.1. Some perspectives on the difference between text and discourse
Table 3.2. Examples of sentences with their presuppositions
Table 3.3. Inventory of mononuclear relations established in [MAN 88]
Table 3.4. Inventory of multinuclear relations established in [MAN 88]
Table 3.5. The constraints on the relation
Table 3.6. The four types of transition possible in a discourse segment

4 The Sphere of Applications

Table 4.1. Comparison of several programming environments in Prolog [FER 00]
Table 4.2. List of the most frequent words in the CSFN
Table 4.3. Comparison tf, cf, tf/cf, idf and tf-idf scores
Table 4.4. Term-document matrix
Table 4.5. Document–document matrix for the collection in Figure 4.27
Table 4.6. Standardized document–document matrix (binary)
Table 4.7. Term–document matrix for our collection and the two queries
Table 4.8. The distances between the queries and the documents in this collection
Table 4.9. Hearst’s extraction patterns for hyponyms
Table 4.10. The basic emotions and corresponding neurotransmitter rates
Table 4.11. Examples of opinions with different types of qualifications
Table 4.12. Categories of terms in WordNetAffect [STR 04]
Table 4.13. Relation between syntactic structures and the elements of an opinion

List of Illustrations

1 The Sphere of Lexicons and Knowledge

Figure 1.1. General diagram of lexical hierarchies in a field [CRU 00]
Figure 1.2. Partial taxonomy of animals
Figure 1.3. Meronymic hierarchy of the human body
Figure 1.4. General structure of a semiotic square
Figure 1.5. Example of a semiotic square for feminine/masculine
Figure 1.6. Lexical structure of the lexical entry car
Figure 1.7. Extract of an SGML document that represents the days of the week
Figure 1.8. Diagram of a possible use of SGML in a real context
Figure 1.9. Structure of a lexical entry
Figure 1.10. Example of the definition of a lexical entry in the form of an SGML document
Figure 1.11. Possible display for the SGML document
Figure 1.12. Example of an XML document that represents the days of the week
Figure 1.13. Use of XML to format lexical entries
Figure 1.14. An example of RDF metadata
Figure 1.15. Example of the entry dresser [BUR 15]
Figure 1.16. Example of a MARTIF format document
Figure 1.17. The components of the LMF standard
Figure 1.18. Class diagram of the LMF’s core
Figure 1.19. Example of a lexicon coded with LMF [FRA 06]
Figure 1.20. Papillon macrostructure with interlanguage links [MAN 06]
Figure 1.21. Examples of interlanguage links [BOI 02]
Figure 1.22. Example of an interlanguage sense in XML [SÉR 01]
Figure 1.23. Microstructure of the lexia murder [MAN 06]
Figure 1.24. Data flow in the DEB system [SMR 03]
Figure 1.25. Example of a lexical entry in the wn_cz dictionary [SMR 03]
Figure 1.26. Example of a lexical entry in the gloss_en dictionary [SMR 03]
Figure 1.27. Result of a search for the word car in WordNet
Figure 1.28. The pivot, prolexeme and instances of the UNO [MAU 08]. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 1.29. A program in Prolog with a micro knowledge base
Figure 1.30. Example of a simple semantic network
Figure 1.31. Example of a semantic network with two multiple inheritances
Figure 1.32. A conceptual graph that represents: the book is on the table
Figure 1.33. The conceptual graph for the sentence: all books are in paper
Figure 1.34. Conceptual graph for John goes to Prague by plane tomorrow
Figure 1.35. Dependencies between the components of the sentence Mary thinks that John wants to go to Prague by plane tomorrow
Figure 1.36. Conceptual graph of the sentence Mary thinks that John wants to go to Prague by plane tomorrow
Figure 1.37. Linear form of the graph in Figure 1.36
Figure 1.38. The class laborer and two instances
Figure 1.39. UNL graph of sentence 1.6
Figure 1.40. Global architecture of an information system with an ontology and an NLP module
Figure 1.41. Role of an ontology in an information exchange process [MAE 03]
Figure 1.42. The levels of knowledge in an ontology (adapted from [RIG 99])
Figure 1.43. Taxonomy of DOLCE [MAS 03]
Figure 1.44. Base structure of the SUMO ontology
Figure 1.45. Examples of representation with the language KIF-SUMO
Figure 1.46. Hierarchy of SNAP entities
Figure 1.47. Hierarchy of SPAN entities

2 The Sphere of Semantics

Figure 2.1. Description of the word pig according to the model in [FOD 64]
Figure 2.2. General diagram of the initial model of generative semantics
Figure 2.3. Semantic representation of the deep structure for: John killed Mary
Figure 2.4. Result of a predicate raising operation
Figure 2.5. Typical structure of a sentence according to Fillmore's model
Figure 2.6. Deep structure of the sentence John gave flowers to Mary
Figure 2.7. Representations of the sentence: John repaired the television with a screwdriver
Figure 2.8. Interactions of the independent components of macrosemantics
Figure 2.9. Architecture of the meaning-text model [MEL 97]
Figure 2.10. Semantic network of the predicate p(x, y) [MEL 97]
Figure 2.11. Lexical semantic rule R1 [MEL 97]
Figure 2.12. A few logical identities
Figure 2.13. A few rules of inference
Figure 2.14. Tree structure of a simple formula
Figure 2.15. Tree structure of a more complex formula
Figure 2.16 Representation of a nominal group with the determinant a
Figure 2.17. The syntactic dependencies of the sentence: John looks at a flower
Figure 2.18. Analysis of a simple sentence with a transitive verb

3 The Sphere of Discourse and Text

Figure 3.1. Diagram of constant topic progression
Figure 3.2. Diagram of linear topic progression
Figure 3.3. Diagram of derived topic progression
Figure 3.4. Diagram of inserted topic progression
Figure 3.5. Prince’s taxonomy [PRI 81]
Figure 3.6. Linear segmentation of a text
Figure 3.7. Example of a text analyzed according to RST [MAN 12]. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 3.8. Example of rules used to construct the discursive structure starting from the syntactic structure [POL 04b]
Figure 3.9. DRT representation of [3.35]
Figure 3.10. DRT representation of [3.35]
Figure 3.11. First step of the algorithm
Figure 3.12. Second step of the algorithm
Figure 3.13. Third step of the algorithm
Figure 3.14. Fourth step of the algorithm
Figure 3.15. Fifth step of the algorithm
Figure 3.16. Sixth step of the algorithm
Figure 3.17. Seventh step of the algorithm
Figure 3.18. Algorithm for processing pronominal references [BRE 87]
Figure 3.19. Transition of centers in example [3.39]
Figure 3.20. Decision tree C4.5 for coreference resolution [MCC 95]
Figure 3.21. Basic algorithm for creating decision trees [QUI 79]
Figure 3.22. Skeleton of a decision tree

4 The Sphere of Applications

Figure 4.1. Examples of simple serial architectures. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.2. Architecture of the Hearsay II system
Figure 4.3. Levels of knowledge in the Hearsay II blackboard
Figure 4.4. Architecture of the Vico system [BER 02]. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.5. Architecture of the MICRO system [CAE 94]
Figure 4.6. Some software configurations for integrating syntax and semantics
Figure 4.7. Runtime of the worst cases observed by length
Figure 4.8. An example of an initial utterance and some derived utterances
Figure 4.9. Functional typology of MT systems according to Carbonell
Figure 4.10. General diagram of the assimilation process
Figure 4.11. General diagram of the dissemination process
Figure 4.12. The Vauquois triangle with some modifications [VAU 68]
Figure 4.13. Architecture of a system using the direct approach
Figure 4.14. Architecture of the transfer approach. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.15. Architecture of a transfer-based system with three languages
Figure 4.16. An example of a syntactic transfer
Figure 4.17. Syntactic transfer with order inversion
Figure 4.18. Example of a synchronized syntactic tree and semantic tree
Figure 4.19. Diagram of a simple transfer [PRI 94]
Figure 4.20. Architecture of pivot-based systems
Figure 4.21. Example of a pivot in the domain of the translation of hotel reservations
Figure 4.22. Diagram of a mediated dialogue in Verbmobil [WAH 00a]
Figure 4.23. The interface of the Verbmobil system [WAH 00a]. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.24. DFDs of information systems and information retrieval systems
Figure 4.25. Curve of the relation between frequency and rank for the first 300 words in the CSFN corpus. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.26. Flat clustering of a set of documents. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.27. Collection of documents
Figure 4.28. List of keywords retained in the documents in this collection
Figure 4.29. Graph of connections between the documents
Figure 4.30. Connected components algorithm
Figure 4.31. Diagrams of naive and complex Bayesian networks
Figure 4.32. General diagram of information retrieval with LSA. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.33. Application of the SVD to a matrix X
Figure 4.34. Collection of documents
Figure 4.35. Approximation factor of 2 of the SVD of X
Figure 4.36. Vector calculation of the queries and the documents in this collection
Figure 4.37. The vectors corresponding to the texts in reduced space (k = 2). For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.38. ELT approach for data warehouses. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.39. BD processing architecture. For a color version of this figure, see www.iste.co.uk/kurdi/language2.zip
Figure 4.40. DFD of an information extraction system and a question answering system
Figure 4.41. DFD of KnowItAll
Figure 4.42. Examples of predicates in the domains of geography and films
Figure 4.43. Induction of rules from examples
Figure 4.44. Dependency tree to extract relations
Figure 4.45. Algorithm to calculate tree similarity
Figure 4.46. Pairs of acquisition by multi-nationals
Figure 4.47. Examples of search results
Figure 4.48. DFD of the REES system
Figure 4.49. Example of an event template
Figure 4.50. The representation of emotions according to Russell’s circumplex model
Figure 4.51. Lövheim’s cube of emotions
Figure 4.52. DFD of a categorical system for extracting subjectivities

First published 2017 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd

27-37 St George’s Road

London SW19 4EU

www.iste.co.uk

John Wiley & Sons, Inc.

111 River Street

Hoboken, NJ 07030

USA

www.wiley.com

The rights of Mohamed Zakaria Kurdi to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2017953290

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-84821-921-2

Introduction

Language is a central tool in our social and professional lives. It is a means to convey ideas, information, opinions and emotions, as well as to persuade, request information, give orders, etc. The interest in language from a computer science point of view began with the start of computer science studies themselves, notably in the context of work in the area of artificial intelligence. The Turing test, one of the first tests developed to determine whether a machine is intelligent or not, stipulates that to be considered as intelligent, the machine must have conversational capacities comparable to those of a human [TUR 50]. This means that an intelligent machine must have the capacity for comprehension and generation, in the broad sense of the terms, hence the interest in natural language processing (NLP) at the dawn of the computer age. Historically, computer processing of languages was very quickly directed toward applied domains such as machine translation (MT) in the context of the Cold War. Thus, the first MT system was created as part of a shared project between Georgetown University and IBM in the United States [DOS 55, HUT 04]. These applied works were not as successful as intended and the researchers quickly became aware that a deep understanding of the linguistic system was a prerequisite for any successful application.

The internet wave between the mid-1990s and the start of the 2000s was a very significant driving force for NLP and related domains, notably information retrieval, which grew from a marginal domain limited to information retrieval in the context of a large company to information retrieval on the scale of the Internet, whose content is constantly growing. This development in terms of the availability of data also favored a discipline that was already in its infancy: Data Science. Located at the intersection of statistics, computer science and mathematics, Data Science focuses on the analysis, visualization and processing of digital data in all forms: images, text and speech. The role of NLP within Data Science is obvious, given that the majority of the information processed is contained in written documents or speech recordings. It is therefore possible to distinguish two different but complementary research approaches in the domain of NLP. On the one hand, there are works that aim to solve the fundamental problem of language processing and that are consequently concerned with the cognitive and linguistics aspects of this problem. On the other hand, several works are dedicated to optimizing and adapting existing NLP techniques for various applied domains such as the medical or banking sectors.

The objective of this book is to provide a comprehensive review of classic and modern works in the domains of lexical databases and the representation of knowledge for NLP, semantics, discourse analysis, and NLP applications such as machine translation and information retrieval. This book also aims to be profoundly interdisciplinary by giving equal consideration to linguistic and cognitive models, algorithms and computer applications as much as possible because we are starting from the premise, which has been proven in NLP and elsewhere time and time again, that the best results are the product of a good theory paired with a well-designed empirical approach.

In addition to the Introduction, this book has four chapters. The first chapter concerns the lexicon and the representation of knowledge. After an introduction to the principles of lexical semantics and theories of lexical meaning, this chapter covers lexical databases, the main procedures for representing knowledge and ontologies. The second chapter is dedicated to semantics. First, the main approaches in combinatorial semantics such as interpretive semantics, generative semantics, case grammar, etc. will be presented. The next section is dedicated to the logical approaches to formal semantics used in the domain of NLP. The third chapter focuses on discourse. It covers the fundamental concepts in discourse analysis such as utterance production, thematic progression, structuring information in discourse, coherence and cohesion. This chapter also presents different approaches to discourse processing such as linear segmentation, discourse analysis and interpretation, and anaphora resolution. The fourth and final chapter is dedicated to NLP applications. First, the fundamental aspects of NLP systems such as software architecture and evaluation approaches are presented. Then, some particularly important applications in the domain of NLP, such as machine translation, information retrieval and information extraction, are reviewed.

Natural Language Processing and Computational Linguistics 2

Semantics, Discourse and Applications

Introduction