Cover Page

FOCUS SERIES

Series Editor Patrick Paroubek

Collaborative Annotation for Reliable Natural Language Processing

Technical and Sociological Aspects

Karën Fort

image072.jpg

Preface

This book presents a unique opportunity for me to construct what I hope to be a consistent image of collaborative manual annotation for Natural Language Processing (NLP). I partly rely on work that has already been published elsewhere, with some of it only in French, most of it in reduced versions and all of it available on my personal website.1 Whenever possible, the original article should be cited in preference to this book.

Also, I refer to publications in French. I retained these publications because there was no equivalent one in English, hoping that at least some readers will be able to understand them.

This work owes a lot to my interactions with Adeline Nazarenko (LIPN/University of Paris 13) both during and after my PhD thesis. In addition, it would not have been conducted to its end without (a lot of) support and help from Benoît Habert (ICAR/ENS of Lyon).

Finally, I would like to thank all the friends who supported me in writing this book and proofread parts of it, as well as the colleagues who kindly accepted that their figures be part of it.

List of Acronyms

ACE
Automatic Content Extraction
ACK
Annotation Collection Toolkit
ACL
Association for Computational Linguistics
AGTK
Annotation Graph Toolkit
API
Application Programming Interface
ATALA
Association pour le Traitement Automatique des LAngues (French Computational Linguistics Society)
HIT
Amazon Mechanical Turk Human Intelligence Task
LDC
Linguistic Data Consortium
NLP
Natural Language Processing
POS
Part-Of-Speech

Introduction

I.1. Natural Language Processing and manual annotation: Dr Jekyll and Mr Hy|ide?

I.1.1. Where linguistics hides

Natural Language Processing (NLP) has witnessed two major evolutions in the past 25 years: first, the extraordinary success of machine learning, which is now, for better or for worse (for an enlightening analysis of the phenomenon see [CHU 11]), overwhelmingly dominant in the field, and second, the multiplication of evaluation campaigns or shared tasks. Both involve manually annotated corpora, for the training and evaluation of the systems (see Figure I.1).

These corpora progressively became the hidden pillars of our domain, providing food for our hungry machine learning algorithms and reference for evaluation. Annotation is now the place where linguistics hides in NLP.

However, manual annotation has largely been ignored for quite a while, and it took some time even for annotation guidelines to be recognized as essential [NÉD 06]. When the performance of the systems began to stall, manual annotation finally started to generate some interest in the community, as a potential leverage for improving the obtained results [HOV 10, PUS 12].

This is all the more important, as it was proven that systems trained on badly annotated corpora underperform. In particular, they tend to reproduce annotation errors when these errors follow a regular pattern and do not correspond to simple noise [REI 08]. Furthermore, the quality of manual annotation is crucial when it is used to evaluate NLP systems. For example, an inconsistently annotated reference corpus would undoubtedly favor machine learning systems, therefore prejudicing rule-based systems in evaluation campaigns. Finally, the quality of linguistic analyses would suffer from an annotated corpus that is unreliable.

image073.jpg

Figure I.1. Manually annotated corpora and machine learning process

Although some efforts have been made lately to address some of the issues presented by manual annotation, there is still little research done on the subject. This book aims at providing some (hopefully useful) insights into the subject. It is partly based on a PhD thesis [FOR 12a] and on some published articles, most of them written in French.

I.1.2. What is annotation?

The renowned British corpus linguist Geoffrey Leech [LEE 97] defines corpus annotation as: “The practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data. ‘Annotation’ can also refer to the end-product of this process”. This definition highlights the interpretative dimension of annotation but limits it to “linguistic information” and to some specific sources, without mentioning its goal.

In [HAB 05], Benoît Habert extends Leech’s definition, first, by not restricting the type of added information: “annotation consists of adding information (a stabilized interpretation) to language data: sounds, characters and gestures”.1 He adds that “it associates two or three steps: (i) segmentation to delimit fragments of data and/or add specific points; (ii) grouping of segments or points to assign them a category; (iii) (potentially) creating relations between fragments or points”.2

We build on these and provide a wider definition of annotation:

DEFINITION (Annotation).– Annotation covers both the process of adding a note on a source signal and the whole set of notes or each note that results from this process, without a priori presuming what the nature of the source (text, video, images, etc.), the semantic content of the note (numbered note, value chosen in a reference list or free text), its position (global or local) or its objective (evaluation, characterization and simple comment) are.

Basically, annotating is adding a note to a source signal. The annotation is therefore the note, anchored in one point or in a segment of the source signal (see Figure I.2). In some cases, the span can be the whole document (for example, in indexing).

image074.jpg

Figure I.2. Anchoring of notes in the source signal

In the case of relations, two or more segments of the source signal are connected and a note is added to the connection. Often, a note is added to the segments too.

This definition of annotation includes many NLP applications, from transcription (the annotation of speech with its written interpretation) to machine translation (the annotation of one language with its translation in another language). However, the analysis we conduct here is mostly centered on categorization (adding a category taken from a list to a segment of signal or between segments of signal). It does not mean that it does not apply to transcription, for example, but we have not yet covered this thoroughly enough to be able to say that the research detailed in this book can directly apply to such applications.

In NLP, annotations can either be added manually by a human interpreter or automatically by an analysis tool. In the first case, the interpretation can reflect parts of the subjectivity of its authors. In the second case, the interpretation is entirely determined by the knowledge and the algorithm integrated in the tool. We are focusing here on manual annotation as a task executed by human agents whom we call annotators.

I.1.3. New forms, old issues

Identifying the first evidence of annotation in history is impossible, but it seems likely that it appeared in the first writings on a physical support allowing for a text to be easily commented upon.

Annotations were used for private purposes (comments from readers) or public usage (explanations from professional readers). They were also used for communicating between writers (authors or copyists, i.e. professional readers) [BAK 10]. In these latter senses, the annotations had a collaborative dimension.

Early manuscripts contained glosses, i.e. according to the online Merriam-Webster dictionary:3 “a brief explanation (as in the margin or between the lines of a text) of a difficult or obscure word or expression”. Glosses were used to inform and train the reader. Other types of annotations were used, for example to update the text (apostils). The form of glosses could vary considerably and [MUZ 85, p. 134] distinguishes between nine different types. Interlinear glosses appear between the lines of a manuscript, marginal glosses in the margin, surrounding glosses in the circumference and separating glosses between the explained paragraphs. They are more or less merged into the text, from the organic gloss, which can be considered as part of the text, to the formal gloss, which constitutes a text in itself, transmitted from copy to copy (as today’s standoff annotations).4

Physical marks, like indention square brackets, could also be added to the text (see an example in a text by Virgil5 in Figure I.3), indicating the first commented words. Interestingly, this primitive anchoring did not indicate the end of the commented part.

image075.jpg

Figure I.3. Indention square brackets in a text by Virgil. Bibliothèque municipale de Lyon (Courtesy of Town Library of Lyon), France, Res. 104 950

The same limitation applies to the auctoritates, which appeared in the 8th Century to cite the authors (considered as authorities) of a citation. The anchoring of the annotation is noted by two dots above the first word of the citation, without indicating the end of it (see Figure I.4).

image076.jpg

Figure I.4. Anchoring of auctoritates in De sancta Trinitate, Basel, UB B.IX.5, extracted from [FRU 12], by courtesy of the authors

This delimitation problem was accentuated by the errors made by the copyists, who moved the auctoritates and their anchors. To solve this issue, and without the possibility of using other markers like quotes (which will appear much later), introductory texts (pre- and peri-annotation) were invented.

From the content point of view, the evolution went from the explanatory gloss (free text) to the citation of authors (name of the author, from a limited list of authorities), precisely identified in the text. As for the anchoring, it improved progressively to look like our present text markup.

This rapid overview shows that many of today’s preoccupations – frontiers to delimit, annotations chosen freely or from a limited reference, anchoring, metadata to transmit – have been around for a while. It also illustrates the fact that adding an annotation is not a spontaneous gesture, but one which is reflected upon, studied, a gesture which is learned.

I.2. Rediscovering annotation

I.2.1. A rise in diversity and complexity

A good indicator of the rise in annotation diversity and complexity is the annotation language. The annotation language is the vocabulary used to annotate the flow of data. In a lot of annotation cases in NLP, this language is constrained.6 It can be of various types.

The simplest is the Boolean type. It covers annotation cases in which only one category is needed. A segment is annotated with this category (which can only be implicit) or not annotated at all. Experiences like the identification of obsolescence segments in encyclopaedia [LAI 09] use this type of language.

Then come the first-order languages. Type languages are, for example, used for morpho-syntactic annotation without features (part-of-speech) or with features (morpho-syntax). The first case is in fact rather rare, as even if the tagset seems little structured, as in the Penn Treebank [SAN 90], features can almost always be deduced from it (for example, NNP, proper name singular, and NNPS, proper name plural, could be translated into NNP + Sg and NNP + Pl).

As for relations, a large variety are annotated in NLP today, from binary-oriented relations (for example, gene renaming relations [JOU 11]) to unoriented n-ary relations (for example, co-reference chains as presented in [POE 05]).

Finally, second-order languages could be used, for example, to annotate relations on relations. In the soccer domain, for example, intercept(pass(p1, p2), p3) represents a pass (relation) between two players (p1 and p2), which is intercepted by another player (p3). In practice, we simplify the annotation by adapting it to a first-order language by reifying the first relation [FOR 12b]. This is so commonly done that we are aware of no example of annotation using a second-order language.

Jean Véronis concluded his state-of-the-art of the automatic annotation technology in 2000 with a figure summarizing the situation [VÉR 00]. On this figure, only the part-of-speech annotation and the multilingual alignment of sentences are considered “operational”. Most applications are considered as prototypes (prosody, partial syntax, multilingual words alignment), and the rest were still not allowing for “applications which are useful in real situations” (full syntax, discourse semantics) or were close to prototypes (phonetic transcription, lexical semantics). The domain has quickly evolved, and today much more complex annotations can be performed on different media and related to a large variety of phenomena.

In the past few years, we have witnessed the multiplication of annotation projects involving video sources, in particular sign language videos. A workshop on the subject (DEGELS) took place during the French NLP conference (TALN) in 2011 and 2012,7 and a training concerning video corpus annotation was organized by the Association pour le Traitement Automatique des LAngues (ATALA) in 2011.8

Moreover, more and more complex semantic annotations are now carried out on a regular basis, like opinions or sentiment annotation. In the biomedical domain, proteins and gene names annotation is now completed by the annotation of relations like gene renaming [JOU 11] or relations between entities, in particular within the framework of BioNLP shared tasks.9 Semantic annotations are also performed using a formal model (i.e. an ontology) [CIM 03], and linked data are now used to annotate corpora, like during the Biomedical Linked Annotation Hackathon (BLAH).10

Finally, annotations that are now considered as traditional, like named entities or anaphora, are getting significantly more complex, for example with added structuring [GRO 11].

However, there are still few corpora freely available with different levels of annotations, including with annotations from different linguistic theories. MASC (Manually Annotated Sub-Corpus) [IDE 08]11 is an interesting exception, as it includes, among others, annotations of frames à la FrameNet [BAK 98] and senses à la WordNet [FEL 98]. Besides, we are not aware of any freely available multimedia-annotated corpus with each level of annotation aligned to the source, but it should not be long until it is developed.

The ever-growing complexity of annotation is taken into account in new annotation formats, like GrAF [IDE 07]; however, it still has to be integrated in the methodology and in the preparation of an annotation campaign.

I.2.2. Redefining manual annotation costs

The exact cost of an annotation campaign is rarely mentioned in research papers. One noteworthy exception is the Prague Dependency TreeBank, for which the authors of [BÖH 01] announce a cost of US$600,000. Other articles detail the number of persons involved in the project they present: GENIA for example, involved 5 part-time annotators, a senior coordinator and one junior coordinator for 1.5 year [KIM 08]. Anyone who participated in such a project knows it that manual annotation is very costly.

However, the resulting annotated corpora, when they are well-documented and available in a suitable format, as shown in [COH 05], are used well beyond and long after the training of the original model or the original research purpose. A typical example of this is the Penn TreeBank corpus, created in the beginning of the 90s [MAR 93] and that is still used more than 20 years later (it is easy to find recent research like [BOH 13]). On the contrary, the tools trained on these corpora usually become quickly outdated as research is making progress. An interesting example is that of the once successful PARTS tagger, created using the Brown corpus [CHU 88] and used to pre-annotate the Penn TreeBank. However, when the technology becomes mature and generates results that the users consider satisfactory, the lifespan of such tools gets longer. This is the case for example in part-of-speech tagging for the TreeTagger [SCH 97], which, with nearly 96% of accuracy for French [ALL 08], is still widely used, despite the fact that it is now less efficient then state-of-the-art results (MElt [DEN 09] for example, obtains 98% accuracy on French). Such domains are still rare.

This trivial remark concerning the lifetime of corpora leads to important consequences with regard to the way we build manually annotated corpora.

First, it puts the cost of the manual work into perspective: a manual corpus costing US$600,000 like the Prague Dependency TreeBank, that has been used for more than 20 years like the Penn TreeBank12Penn Discourse TreeBankPenn TreeBank

Second, it is a strong argument not for building manually annotated corpora according to the possibilities of the system(s) that will use it, as they will be long forgotten when the annotated corpus is still be used. If the corpus is too dependent on the systems’ (limited) capabilities, it will not be useful anymore when the algorithms become more efficient.

Third, this implies that manual annotation should be of high quality, i.e. well-prepared, well-documented and regularly evaluated with adequate metrics. Manual annotation campaign preparation is often rushed and overlooked, because people want to get it over with as quickly as possible.13 This has been particularly emphasized in [SAM 00], where the author notes (on p. 7) that: “[…] it seems to me that natural language computing has yet to take on board the software-engineering lesson of the primacy of problem analysis and documentation over coding”.

There is, in fact, a need for annotation engineering procedures and tools and this is what this book aims at providing, at least partly.