FOCUS SERIES
Series Editor Patrick Paroubek
First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St George’s Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com
© ISTE Ltd 2016
The rights of Karën Fort to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2016936602
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISSN 2051-2481 (Print)
ISSN 2051-249X (Online)
ISBN 978-1-84821-904-5
This book presents a unique opportunity for me to construct what I hope to be a consistent image of collaborative manual annotation for Natural Language Processing (NLP). I partly rely on work that has already been published elsewhere, with some of it only in French, most of it in reduced versions and all of it available on my personal website.1 Whenever possible, the original article should be cited in preference to this book.
Also, I refer to publications in French. I retained these publications because there was no equivalent one in English, hoping that at least some readers will be able to understand them.
This work owes a lot to my interactions with Adeline Nazarenko (LIPN/University of Paris 13) both during and after my PhD thesis. In addition, it would not have been conducted to its end without (a lot of) support and help from Benoît Habert (ICAR/ENS of Lyon).
Finally, I would like to thank all the friends who supported me in writing this book and proofread parts of it, as well as the colleagues who kindly accepted that their figures be part of it.
Amazon Mechanical Turk
Human Intelligence TaskNatural Language Processing (NLP) has witnessed two major evolutions in the past 25 years: first, the extraordinary success of machine learning, which is now, for better or for worse (for an enlightening analysis of the phenomenon see [CHU 11]), overwhelmingly dominant in the field, and second, the multiplication of evaluation campaigns or shared tasks. Both involve manually annotated corpora, for the training and evaluation of the systems (see Figure I.1).
These corpora progressively became the hidden pillars of our domain, providing food for our hungry machine learning algorithms and reference for evaluation. Annotation is now the place where linguistics hides in NLP.
However, manual annotation has largely been ignored for quite a while, and it took some time even for annotation guidelines to be recognized as essential [NÉD 06]. When the performance of the systems began to stall, manual annotation finally started to generate some interest in the community, as a potential leverage for improving the obtained results [HOV 10, PUS 12].
This is all the more important, as it was proven that systems trained on badly annotated corpora underperform. In particular, they tend to reproduce annotation errors when these errors follow a regular pattern and do not correspond to simple noise [REI 08]. Furthermore, the quality of manual annotation is crucial when it is used to evaluate NLP systems. For example, an inconsistently annotated reference corpus would undoubtedly favor machine learning systems, therefore prejudicing rule-based systems in evaluation campaigns. Finally, the quality of linguistic analyses would suffer from an annotated corpus that is unreliable.
Although some efforts have been made lately to address some of the issues presented by manual annotation, there is still little research done on the subject. This book aims at providing some (hopefully useful) insights into the subject. It is partly based on a PhD thesis [FOR 12a] and on some published articles, most of them written in French.
The renowned British corpus linguist Geoffrey Leech [LEE 97] defines corpus annotation as: “The practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data. ‘Annotation’ can also refer to the end-product of this process”. This definition highlights the interpretative dimension of annotation but limits it to “linguistic information” and to some specific sources, without mentioning its goal.
In [HAB 05], Benoît Habert extends Leech’s definition, first, by not restricting the type of added information: “annotation consists of adding information (a stabilized interpretation) to language data: sounds, characters and gestures”.1 He adds that “it associates two or three steps: (i) segmentation to delimit fragments of data and/or add specific points; (ii) grouping of segments or points to assign them a category; (iii) (potentially) creating relations between fragments or points”.2
We build on these and provide a wider definition of annotation:
DEFINITION (Annotation).– Annotation covers both the process of adding a note on a source signal and the whole set of notes or each note that results from this process, without a priori presuming what the nature of the source (text, video, images, etc.), the semantic content of the note (numbered note, value chosen in a reference list or free text), its position (global or local) or its objective (evaluation, characterization and simple comment) are.
Basically, annotating is adding a note to a source signal. The annotation is therefore the note, anchored in one point or in a segment of the source signal (see Figure I.2). In some cases, the span can be the whole document (for example, in indexing).
In the case of relations, two or more segments of the source signal are connected and a note is added to the connection. Often, a note is added to the segments too.
This definition of annotation includes many NLP applications, from transcription (the annotation of speech with its written interpretation) to machine translation (the annotation of one language with its translation in another language). However, the analysis we conduct here is mostly centered on categorization (adding a category taken from a list to a segment of signal or between segments of signal). It does not mean that it does not apply to transcription, for example, but we have not yet covered this thoroughly enough to be able to say that the research detailed in this book can directly apply to such applications.
In NLP, annotations can either be added manually by a human interpreter or automatically by an analysis tool. In the first case, the interpretation can reflect parts of the subjectivity of its authors. In the second case, the interpretation is entirely determined by the knowledge and the algorithm integrated in the tool. We are focusing here on manual annotation as a task executed by human agents whom we call annotators.
Identifying the first evidence of annotation in history is impossible, but it seems likely that it appeared in the first writings on a physical support allowing for a text to be easily commented upon.
Annotations were used for private purposes (comments from readers) or public usage (explanations from professional readers). They were also used for communicating between writers (authors or copyists, i.e. professional readers) [BAK 10]. In these latter senses, the annotations had a collaborative dimension.
Early manuscripts contained glosses, i.e. according to the online Merriam-Webster dictionary:3 “a brief explanation (as in the margin or between the lines of a text) of a difficult or obscure word or expression”. Glosses were used to inform and train the reader. Other types of annotations were used, for example to update the text (apostils). The form of glosses could vary considerably and [MUZ 85, p. 134] distinguishes between nine different types. Interlinear glosses appear between the lines of a manuscript, marginal glosses in the margin, surrounding glosses in the circumference and separating glosses between the explained paragraphs. They are more or less merged into the text, from the organic gloss, which can be considered as part of the text, to the formal gloss, which constitutes a text in itself, transmitted from copy to copy (as today’s standoff annotations).4
Physical marks, like indention square brackets, could also be added to the text (see an example in a text by Virgil5 in Figure I.3), indicating the first commented words. Interestingly, this primitive anchoring did not indicate the end of the commented part.
The same limitation applies to the auctoritates, which appeared in the 8th Century to cite the authors (considered as authorities) of a citation. The anchoring of the annotation is noted by two dots above the first word of the citation, without indicating the end of it (see Figure I.4).
This delimitation problem was accentuated by the errors made by the copyists, who moved the auctoritates and their anchors. To solve this issue, and without the possibility of using other markers like quotes (which will appear much later), introductory texts (pre- and peri-annotation) were invented.
From the content point of view, the evolution went from the explanatory gloss (free text) to the citation of authors (name of the author, from a limited list of authorities), precisely identified in the text. As for the anchoring, it improved progressively to look like our present text markup.
This rapid overview shows that many of today’s preoccupations – frontiers to delimit, annotations chosen freely or from a limited reference, anchoring, metadata to transmit – have been around for a while. It also illustrates the fact that adding an annotation is not a spontaneous gesture, but one which is reflected upon, studied, a gesture which is learned.
A good indicator of the rise in annotation diversity and complexity is the annotation language. The annotation language is the vocabulary used to annotate the flow of data. In a lot of annotation cases in NLP, this language is constrained.6 It can be of various types.
The simplest is the Boolean type. It covers annotation cases in which only one category is needed. A segment is annotated with this category (which can only be implicit) or not annotated at all. Experiences like the identification of obsolescence segments in encyclopaedia [LAI 09] use this type of language.
Then come the first-order languages. Type languages are, for example, used for morpho-syntactic annotation without features (part-of-speech) or with features (morpho-syntax). The first case is in fact rather rare, as even if the tagset seems little structured, as in the Penn Treebank [SAN 90], features can almost always be deduced from it (for example, NNP, proper name singular, and NNPS, proper name plural, could be translated into NNP + Sg and NNP + Pl).
As for relations, a large variety are annotated in NLP today, from binary-oriented relations (for example, gene renaming relations [JOU 11]) to unoriented n-ary relations (for example, co-reference chains as presented in [POE 05]).
Finally, second-order languages could be used, for example, to annotate relations on relations. In the soccer domain, for example, intercept(pass(p1, p2), p3) represents a pass (relation) between two players (p1 and p2), which is intercepted by another player (p3). In practice, we simplify the annotation by adapting it to a first-order language by reifying the first relation [FOR 12b]. This is so commonly done that we are aware of no example of annotation using a second-order language.
Jean Véronis concluded his state-of-the-art of the automatic annotation technology in 2000 with a figure summarizing the situation [VÉR 00]. On this figure, only the part-of-speech annotation and the multilingual alignment of sentences are considered “operational”. Most applications are considered as prototypes (prosody, partial syntax, multilingual words alignment), and the rest were still not allowing for “applications which are useful in real situations” (full syntax, discourse semantics) or were close to prototypes (phonetic transcription, lexical semantics). The domain has quickly evolved, and today much more complex annotations can be performed on different media and related to a large variety of phenomena.
In the past few years, we have witnessed the multiplication of annotation projects involving video sources, in particular sign language videos. A workshop on the subject (DEGELS) took place during the French NLP conference (TALN) in 2011 and 2012,7 and a training concerning video corpus annotation was organized by the Association pour le Traitement Automatique des LAngues (ATALA) in 2011.8
Moreover, more and more complex semantic annotations are now carried out on a regular basis, like opinions or sentiment annotation. In the biomedical domain, proteins and gene names annotation is now completed by the annotation of relations like gene renaming [JOU 11] or relations between entities, in particular within the framework of BioNLP shared tasks.9 Semantic annotations are also performed using a formal model (i.e. an ontology) [CIM 03], and linked data are now used to annotate corpora, like during the Biomedical Linked Annotation Hackathon (BLAH).10
Finally, annotations that are now considered as traditional, like named entities or anaphora, are getting significantly more complex, for example with added structuring [GRO 11].
However, there are still few corpora freely available with different levels of annotations, including with annotations from different linguistic theories. MASC (Manually Annotated Sub-Corpus) [IDE 08]11 is an interesting exception, as it includes, among others, annotations of frames à la FrameNet [BAK 98] and senses à la WordNet [FEL 98]. Besides, we are not aware of any freely available multimedia-annotated corpus with each level of annotation aligned to the source, but it should not be long until it is developed.
The ever-growing complexity of annotation is taken into account in new annotation formats, like GrAF [IDE 07]; however, it still has to be integrated in the methodology and in the preparation of an annotation campaign.
The exact cost of an annotation campaign is rarely mentioned in research papers. One noteworthy exception is the Prague Dependency TreeBank, for which the authors of [BÖH 01] announce a cost of US$600,000. Other articles detail the number of persons involved in the project they present: GENIA for example, involved 5 part-time annotators, a senior coordinator and one junior coordinator for 1.5 year [KIM 08]. Anyone who participated in such a project knows it that manual annotation is very costly.
However, the resulting annotated corpora, when they are well-documented and available in a suitable format, as shown in [COH 05], are used well beyond and long after the training of the original model or the original research purpose. A typical example of this is the Penn TreeBank corpus, created in the beginning of the 90s [MAR 93] and that is still used more than 20 years later (it is easy to find recent research like [BOH 13]). On the contrary, the tools trained on these corpora usually become quickly outdated as research is making progress. An interesting example is that of the once successful PARTS
tagger, created using the Brown corpus [CHU 88] and used to pre-annotate the Penn TreeBank. However, when the technology becomes mature and generates results that the users consider satisfactory, the lifespan of such tools gets longer. This is the case for example in part-of-speech tagging for the TreeTagger
[SCH 97], which, with nearly 96% of accuracy for French [ALL 08], is still widely used, despite the fact that it is now less efficient then state-of-the-art results (MElt
[DEN 09] for example, obtains 98% accuracy on French). Such domains are still rare.
This trivial remark concerning the lifetime of corpora leads to important consequences with regard to the way we build manually annotated corpora.
First, it puts the cost of the manual work into perspective: a manual corpus costing US$600,000 like the Prague Dependency TreeBank, that has been used for more than 20 years like the Penn TreeBank12Penn Discourse TreeBankPenn TreeBank
Second, it is a strong argument not for building manually annotated corpora according to the possibilities of the system(s) that will use it, as they will be long forgotten when the annotated corpus is still be used. If the corpus is too dependent on the systems’ (limited) capabilities, it will not be useful anymore when the algorithms become more efficient.
Third, this implies that manual annotation should be of high quality, i.e. well-prepared, well-documented and regularly evaluated with adequate metrics. Manual annotation campaign preparation is often rushed and overlooked, because people want to get it over with as quickly as possible.13 This has been particularly emphasized in [SAM 00], where the author notes (on p. 7) that: “[…] it seems to me that natural language computing has yet to take on board the software-engineering lesson of the primacy of problem analysis and documentation over coding”.
There is, in fact, a need for annotation engineering procedures and tools and this is what this book aims at providing, at least partly.