Cover Page

Acknowledgments

I would like to thank the University of Franche-Comté and my colleagues in the ELLIADD laboratory for believing in the NooJ project and supporting the community of NooJ users unfailingly since its inception.

It would be impossible for me to mention every single one of the colleagues and students who have participated, in one way or another, in the extremely ambitious project described in this book – that of formalizing natural languages! The NooJ software has been in use since 2002 by a community of researchers and students; see www.nooj4nlp.net. NooJ was developed in direct cooperation with all its users who devoted their energy to this or that specific problem, or to one language or another. Spelling in Semitic languages, variation in Asian languages, intonation in Armenian, inflection in Hungarian, phrasal verbs in English, derivation in Slavic languages, composition in Greek and in Germanic languages, etc. pose a wide variety of linguistic problems, and without the high standards of these linguists the NooJ project would never have known the success it is experiencing today. Very often, linguistic questions that seemed “trivial” at the time have had a profound influence on the development of NooJ.

Among its users, there are some “NooJ experts” to whom I would like to give particular thanks, as they participated directly in its design, and had the patience to help me with long debugging sessions. I thank them for their ambition and their patience: Héla Fehri, Kristina Kocijan, Slim Mesfar, Cristina Mota, and Simonetta Vietri.

I would also like to thank Danielle Leeman and François Trouilleux for their detailed review of the original book, and Peter Machonis for his review of the English version, as well as for verifying the relevance of the English examples, which contributed greatly to the quality of this book.

Max SILBERZTEIN

November, 2015.

For Nadia “Nooj” Malinovich Silberztein, the Mensch of the family, without whom neither this book, nor the project named after her, would have happened.

And for my two children, Avram and Rosa, who remind me every day of the priorities in my life.

Series Editor

Patrick Paroubek

Formalizing Natural Languages

The NooJ Approach

Max Silberztein

wiley Logo

WILEY END USER LICENSE AGREEMENT

Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.

Bibliography

  1. [AHO 03] AHO A., LAM M., SETHI R. et al., Compilers: Principles, Techniques, and Tools, 2nd ed., Addison Wesley, 2006.
  2. [ALL 07] ALLAUZEN C., RILEY M., SCHALKWYK J., “Open Fst: a general and efficient weighted finite-state transducer library”, Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA), vol. 4783, pp. 11–23, 2007.
  3. [AME 11] American Heritage Dictionary of the English Language, Fifth Edition. Boston: Houghton Mifflin Company, 2011.
  4. [AOU 07] AOUGHLIS F., “A computer science dictionary for NooJ”, Lecture Notes in Computer Science, Springer-Verlag, vol. 4592, p. 341–351, 2007.
  5. [BAC 59] BACKUS J., “The syntax and semantics of the proposed international algebraic language of the Zurich ACM-GAMM conference”, Proceedings of the International Conference on Information Processing, UNESCO, pp. 125–132, 1959.
  6. [BAL 02] BALDRIDGE J., Lexically Specified Derivational Control in Combinatory Categorial Grammar, PhD Dissertation. Univ. of Edinburgh, 2002.
  7. [BAR 08] BARREIRO A., “Para MT: a paraphraser for machine translation”, Lecture Notes in Computer Science, Springer-Verlag, vol. 5190, pp. 202–211, 2008.
  8. [BAR 14] BARREIRO A., BATISTA F., RIBEIRO R. et al., “Open Logos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries”, Proceedings of the 9th edition of the LREC Conference, 2014.
  9. [BEN 15] BEN A., FEHRI H., BEN H., “Translating Arabic relative clauses into English using NooJ”, Formalising Natural Languages with NooJ 2014, Cambridge Scholars Publishing, Newcastle, 2015.
  10. [BEN 10] BEN H., PITON O., FEHRI H., “Recognition and Arabic-French translation of named entities: case of the sport places”, Finite-State Language Engineering with NooJ: Selected Papers from the NooJ 2009 International Conference, Sfax University Press, Tunisia, 2010.
  11. [BER 60] BERNER R., “A proposal for character code compatibility”, Communications of the ACM, vol. 3, no. 2, pp. 71–72, 1960.
  12. [BIN 90] BINYONG Y., FELLEY M., Chinese Romanization: Pronunciation and Orthography, Sinolingua, Peking, 1990.
  13. [BLA 90] BLAKE B., Relational Grammar, Routledge, London, 1990.
  14. [BLO 33] BLOOMFIELD L., Language, Henry Holt, New York, 1933.
  15. [BÖG 07] BÖGEL T., BUTT M., HAUTLI A. et al., “Developing a finite-state morphological analyzer for Urdu and Hindi: some issues”, Proceedings of FSMNLP07, Potsdam, Germany, 2007.
  16. [BRI 92] BRILL E., “A simple rule-based part of speech tagger”, Proceedings of the ANLC’92 3rd Conference on Applied Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, 1992.
  17. [BRU 02] BRUNSTEIN B., “Annotation guidelines for answer types”, Linguistic Data Consortium, Philadelphia, 2002.
  18. [DON 13] DONNELLY C., STALLMAN R., GNU Bison-The Yacc-Compatible Parser Generator: Bison Version 2.7, FSF, p. 201, 2013.
  19. [CHA 97] CHARNIAK E., “Statistical techniques for natural language parsing”, AI Magazine, vol. 18, no. 4, p. 33, 1997.
  20. [CHO 57] CHOMSKY N., Syntactic Structures, Mouton: The Hague, 1957.
  21. [CHR 92] CHRISTIANSEN M., “The (non) necessity of recursion in natural language processing”, Proceedings of the 14th Annual Conference of the Cognitive Science Society, Cognitive Science Society, Indiana University, pp. 665–670, 1992.
  22. [CHR 99] CHROBOT A., COURTOIS B., HAMMANI-MCCARTHY M. et al., Dictionnaire électronique DELAC anglais: noms composés. Technical Report 59, LADL, Université Paris 7, 1999.
  23. [COU 90a] COURTOIS B., “Un système de dictionnaires électroniques pour les mots simples du français”, in COURTOIS B., SILBERZTEIN M. (eds), Dictionnaires électroniques du français, Larousse, Paris, pp. 5–10, 1990.
  24. [COU 90b] COURTOIS B., SILBERZTEIN M., Les dictionnaires électroniques, Langue française no. 87, Larousse, Paris, 1990.
  25. [CAL 95] DALRYMPLE M., KAPLAN R., MAXWELL J. et al., Formal Issues in Lexical-Functional Grammar, CSLI Publications, Stanford, 1995.
  26. [DAN 85] DANLOS L., Génération automatique de textes en langue naturelle, Masson, Paris, 1985.
  27. [DON 07] DONABEDIAN B., “La lemmatisation de l’arménien occidental avec NooJ”, Formaliser les langues avec l’ordinateur: De INTEX à NooJ, Les cahiers de la MSH Ledoux, Presses universitaires de Franche-Comté, pp. 55–75, 2007.
  28. [DON 13] DONNELLY Ch. S. R., The Bison Manual, https://jdcqivvcr.updog.co/amRjcWl2dmNyMTg4MjExNDIzWA.pdf, 2013.
  29. [DUB 97] DUBOIS J., DUBOIS-CHARLIER F., Les verbes français, Larousse, Paris, 1997.
  30. [DUB 10] DUBOIS J., DUBOIS-CHARLIER F., “La combinatoire lexico-syntaxique dans le Dictionnaire électronique des mots”, Langages, vol. 3, pp. 31–56, 2010.
  31. [DUR 14] DURAN M., “Formalising Quechua Verb Inflection”, Formalising Natural Languages with NooJ 2013: Selected Papers from the NooJ 2013 International Conference (Saarbrucken, Germany), Cambridge Scholars Publishing, Newcastle, 2014.
  32. [EIL 74] EILENBERG S., Automata, Languages and Machines, Academic Press, New York, 1974.
  33. [EVE 95] EVERAERT M., VAN DER LINDEN E.-J., SCHENK A. et al. (eds), Idioms: Structural and psychological perspectives, Erlbaum, Hillsdale, NJ, 1995.
  34. [FEH 10] FEHRI H., HADDAR K., BEN H., “Integration of a transliteration process into an automatic translation system for named entities from Arabic to French”, Proceedings of the NooJ 2009 International Conference and Workshop, Sfax, University Press, pp. 285–300, 2010.
  35. [FEL 14] FELLBAUM C., “Word Net: an electronic lexical resource for English”, in CHIPMAN S. (ed.), The Oxford Handbook of Cognitive Science, Oxford University Press, New York, 2014.
  36. [FIL 08] FILLMORE C., “A valency dictionary of English”, International Journal of Lexicography Advance Access, October 8, 2008.
  37. [FRE 85] FRECKLETON P., Sentence idioms in English, Working Papers in Linguistics 11, University of Melbourne. 1985.
  38. [FRI 03] FRIEDERICI A., KOTZ S., “The brain basis of syntactic processes: functional imaging and lesion studies”, Neuroimage, vol. 20, no. 1, pp. S8–S17, 2003.
  39. [GAZ 85] GAZDAR G., KLEIN E., PULLUM G. et al., Generalized Phrase Structure Grammar, Blackwell and Cambridge, Harvard University Press, Oxford, MA, 1985.
  40. [GAZ 88] GAZDAR G., “Applicability of Indexed Grammars to Natural Languages”, in REYLE U., ROHRER C. (eds), Natural Language Parsing and Linguistic Theories, Studies in Linguistics and Philosophy 35, D. Reidel Publishing Company, pp. 69–94, 1988.
  41. [GRA 02] GRASS T., MAUREL D., PITON O., “Description of a multilingual database of proper names”, Lecture Notes in Computer Science, vol. 2389, pp. 31–36, 2002.
  42. [GRE 11] GREENEMEIER L., “Say what? Google works to improve YouTube autocaptions for the deaf”, Scientific American, 23rd June 2011.
  43. [GRO 68] GROSS M., Grammaire transformationnelle du français, 1: le verbe, Larousse, Paris, 1968.
  44. [GRO 75] GROSS M., Méthodes en syntaxe, Hermann, Paris, 1975.
  45. [GRO 77] GROSS M., Grammaire transformationnelle du français, 2: syntaxe du nom, Larousse, Paris, 1977.
  46. [GRO 86] GROSS M., Grammaire transformationnelle du français, 3: syntaxe de l’adverbe, Cantilène, Paris, 1986.
  47. [GRO 94] GROSS M., “Constructing lexicon-grammars”, Computational Approaches to the Lexicon, Oxford University Press, pp. 213–263, 1994.
  48. [GRO 96] GROSS M., “Lexicon Grammar”, in BROWN K., MILLER J. (eds), Concise Encyclopedia of Syntactic Theories, Elsevier, New York, pp. 244–258, 1996.
  49. [HAL 94] HALLIDAY M., Introduction to Functional Grammar, 2nd edition, Edward Arnold, London, 1994.
  50. [HAR 70] HARRIS Z., Papers in Structural and Transformational Linguistics, Springer Science and Business Media, Dodrecht, 1970.
  51. [HAR 02] HARALAMBOUS Y., “Unicode ettypographie: un amour impossible”, Document Numérique, vol. 6, no. 3, pp. 105–137, 2002.
  52. [HER 04] HERBST T., HEATH D., ROE I. et al., (eds). A Valency Dictionary of English: A Corpus-Based Analysis of the Complementation Patterns of English Verbs, Nouns and Adjectives, Mouton de Gruyter Berlin, 2004.
  53. [HO 78] HO S.H., “An analysis of the two Chinese radical systems”, Journal of the Chinese Language Teachers Association, vol. 13, no. 2, pp. 95–109, 1978.
  54. [HOB 99] HOBBS A., Five-Unit Codes, The North American Data Communications Museum, Sandy Hook, CT, available at: www.nadcomm.com/fiveunit/fiveunits.htm, 1999.
  55. [HOP 79] HOPCROFT J., ULLMAN J., Introduction to Automata Theory, Languages and Computation, Addison-Wesley Publishing, Reading Massachusetts, 1979.
  56. [ITT 07] ITTYCHERIAH A., ROUKOS S., “IBM’s statistical question answering system”, TREC-11 Proceedings, NIST Special Publication, available at: trec.nist.gov/pubs.html, 2007.
  57. [JOH 74] JOHNSON D., Toward a Theory of Relationally-Based Grammar, Garland Publishing, New York, 1974.
  58. [JOH 12] JOHNSON S.C., Yacc: Yet Another Compile Compiler, AT&T Bell Laboratories Murray Hill, NJ, Nov. 2012.
  59. [JOS 87] JOSHI A., “An introduction to tree adjoining grammars”, in MANASTER-RAMER A. (ed.), Mathematics of Language, John Benjamins, Amsterdam, pp. 87–114, 1987.
  60. [JUR 00] JURAFSKY D., MARTIN J., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall, New York, 2000.
  61. [KAP 82] KAPLAN R., BRESNAN J., “Lexical-functional grammar: a formal system for grammatical representation”, in BRESNAN J. (ed.), The Mental Representation of Grammatical Relations, pp. 173–281, MIT Press, Cambridge, 1982.
  62. [KAR 97] KARTTUNEN L., TAMÁS G., KEMPE A., Xerox finite-state tool, Technical report, Xerox Research Centre Europe, 1997.
  63. [KAR 07] KARLSSON F., “Constraints on multiple center-embedding of clauses”, Journal of Linguistics, vol. 43, no. 2, pp. 365–392, 2007.
  64. [KLA 91] KLARSFELDG., HAMMANI-MCCARTHY M., Dictionnaire électronique du LADL pour les mots simples de l’anglais, Technical report, LADL, Université Paris 7, 1991.
  65. [KLE 56] KLEENE S.C., “Representation of events in nerve nets and finite automata”, Automata Studies, Annals of Mathematics Studies, vol. 34, pp. 3–41, 1956.
  66. [KÜB 02] KÜBLER N., “Creating a term base to customise an MT system: reusability of resources and tools from the translator’s point of view”, Proceedings of the 1st International Workshop on Language Resources in Translation Work and Research (LREC), Las Palmas de Gran Canaria, pp. 44–48, 2002.
  67. [KUP 08] KUPŚĆ A., ABEILLÉ A., “Growing tree Lex”, Computational Linguistics and Intelligent Text Processing, vol. 4919, pp. 28–39, 2008.
  68. [LEC 98] LECLÈRE C., “Travaux récents en lexique-grammaire”, Travaux de linguistique, vol. 37, pp. 155–186, 1998.
  69. [LEC 05] LECLÈRE C., “The lexicon-grammar of french verbs: a syntactic database”, in KAWAGUCHI Y., ZAIMA S., TAKAGAKI et al. (eds.), Linguistic Informatics – State of the Art and the Future, pp. 29–45, Benjamins, Amsterdam/Philadelphia, 2005.
  70. [LEE 90] LEEMAN D., MELEUC S., “Verbes en table et adjectifs en –able”, in COURTOIS B., SILBERZTEIN M. (eds), Dictionnaires électroniques du français, Larousse, Paris, pp. 30–51, 1990.
  71. [LEV 93] LEVIN B. English Verb Classes and Alternations. The University of Chicago Press, Chicago, 1993.
  72. [LIN 08] LIN H.C., “Treatment of Chinese orthographical and lexical variants with NooJ”, in BLANCO X., SILBERZTEIN M. (eds), Proceedings of the 2007 International NooJ Conference, pp. 139–148, Cambridge Scholars Publishing, Cambridge, 2008.
  73. [LIN 10] LINDÉN K., SILFVERBERG M., PIRINEN T., HFST Tools for Morphology: An Efficient Open-Source Package for Construction of Morphological Analysers, University of Helsinki, Finland, 2010.
  74. [MCC 03] MCCARTHY, D., KELLER B., CARROLL J.,“Detecting a continuum of compositionality in phrasal verbs”, Proceedings of the ACL 2003 Workshop on Multiword Expressions, 2003.
  75. [MAC 10] MACHONIS P., “English phrasal verbs: from Lexicon-Grammar to Natural Language Processing”, Southern Journal of Linguistics, vol. 34, no. 1. pp. 21–48, 2010.
  76. [MAC 12] MACHONIS P., “Sorting NooJ out to take multiword expressions into account”, in VUČKOVIĆ K. et al. (ed.), Proceedings of the NooJ 2011 Conference, pp. 152–165, Cambridge Scholars Publishing, Newcastle, 2012.
  77. [MAC 94] MACLEOD C., GRISHMAN R., MEYERS A., “Creating a Common Syntactic Dictionary of English”, Proceedings of the International Workshop on Shareable Natural Language Resources, Nara, Japan, August 10–11, 1994.
  78. [MAC 04] MACLEOD C., GRISHMAN R., MEYERS A. et al., “The NomBank Project: an interim report”, HLT-NAACL 2004 Workshop: Frontiers in Corpus Annotations, 2004.
  79. [MAN 99] MANNING C., SCHÜTZE H., Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, 1999.
  80. [MAN 01] MANI I. (ed.), Automatic Summarization, John Benjamins Publishing, Amsterdam, Philadelphia, 2001.
  81. [MAR 93] MARCUS M., SANTORINI B., MARCINKIEWICZ M., “Building a large annotated corpus of English: the Penn Treebank”, Computational Linguistics, vol. 19, no. 2, pp. 313–330, 1993.
  82. [MCI 81] MCILROY D., “Development of a spelling list”, IEEE Transactions on Communications, vol. 30, no. 1, pp. 91–99, 1981.
  83. [MEL 87] MEL’ČUK I., Dependency Syntax: Theory and Practice, Albany State University Press of New York, 1987.
  84. [MES 08a] MESFAR S., Analyse morpho-syntaxique automatique et reconnaissance des entités nommées en arabe standard, Thesis, University of Franche-Comté, 2008.
  85. [MES 08b] MESFAR S., SILBERZTEIN M., “Transducer minimization and information compression for NooJ dictionaries”, Proceedings of the FSMNLP 2008 Conference, Frontiers in Artificial Intelligence and Applications, IOS Press, The Netherlands, 2008.
  86. [MOG 08] MOGORRON P., MEJRI S., Las construccionesverbo-nominales libres y fijas, available at: halshs.archives-ouvertes.fr/halshs-00410995, 2008.
  87. [MON 14] MONTELEONE M., VIETRI S., “The NooJ English Dictionary”, in KOEVA S., MESFAR S., SILBERZTEIN M. (eds.), Formalising Natural Languages with NooJ 2013: Selected Papers from the NooJ 2013 International Conference, Cambridge Scholars Publishing, Newcastle, UK, 2014.
  88. [MOO 56] MOORE E., “Gedanken experiments on sequential machines”, Automata studies, Annals of mathematics studies, vol. 32, pp. 129–153, Princeton University Press, 1956.
  89. [MOO 00] MOORE R.C., “Removing left recursion from context-free grammars”, 6th Applied Natural Language Processing Conference / Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics Conference, Association for Computational Linguistics, , pp. 249–255, 2000.
  90. [MOO 97] MOORTGAT M., “Categorial type logics”, in VAN BENTHEM J., MEULEN T. (eds), Handbook of Logic and Language, Elsevier, pp. 93–178, 1997.
  91. [NUN 94] NUNBERG G., SAG I., WASOW T., “Idioms”, Language, vol. 70, pp. 491–538, 1994.
  92. [POL 84] POLLARD C., Generalized Phrase Structure Grammars, Head Grammars, and Natural Language, Ph.D. thesis, Stanford University, 1984.
  93. [POL 94] POLLARD C., SAG I., Head-Driven Phrase Structure Grammar, University of Chicago Press, Chicago, 1994.
  94. [RAY 06] RAYNER M., HOCKEY B., BOUILLON P., Putting Linguistics into Speech Recognition: The Regulus Grammar Compiler, CSLI Publications, Stanford, 2006.
  95. [RHE 88] RHEINGOLD H., They Have a Word for It: A Lighthearted Lexicon of Untranslatable Words and Phrases, Jeremy P. Tarcher Inc., Los Angeles, 1988.
  96. [ROC 97] ROCHE E., SCHABES Y. (eds), Finite-State Language Processing, MIT Press, Cambridge, MA, 1997.
  97. [ROU 06] ROUX M., EL ZANT M., ROYAUTÉ J., “Projet Epidemia, intervention des transducteurs NooJ”, Actes des 9èmes journées scientifiques INTEX/NooJ, Belgrade, 1–3 June 2006.
  98. [SAB 13] SABATIER P., LE PESANT D., “Les dictionnaires électroniques de Jean Dubois et Françoise Dubois-Charlier et leur exploitation en TAL”, in GALA N., ZOCK M. (eds), Ressources Lexicales, Linguisticae Investigationes Supplementa 30, John Benjamins Publishing Company, Amsterdam, 2013.
  99. [SAG 02] SAG I., BALDWIN T., BOND F. et al., “Multiword Expressions: A Pain in the Neck for NLP”, in Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, pp. 1–15, Mexico City, 2002.
  100. [SAL 83] SALKOFF M., “Bees are swarming in the garden: a systematic synchronic study of productivity”, Language, vol. 59, no. 2, 1983.
  101. [SAL 99] SALKOFF M., A French-English Grammar: A Contrastive Grammar on Translation Principles, John Benjamins, Amsterdam, 1999.
  102. [SAL 04] SALKOFF M., “Verbs of mental states”, in Lexique, syntaxe et lexique-grammaire. Papers in honour of Maurice Gross, volume 24 of Lingvisticæ Investigationes Sup-plementa, pp. 561–571, Benjamins, Amsterdam/Philadelphia, 2004.
  103. [SAU 16] SAUSSURE F., Cours de linguistique générale, Payot, Paris, 1916.
  104. [SCH 05] SCHMID H., “A programming language for finite-state transducers”, Proceedings of the 5th International Workshop on Finite State Methods in Natural Language Processing (FSMNLP), Helsinki, Finland, 2005.
  105. [SIL 87] SILBERZTEIN M., “The lexical analysis of French”, Electronic Dictionaries and Automata in Computational Linguistics, vol. 377, pp. 93–110, 1987.
  106. [SIL 90] SILBERZTEIN M., “Le dictionnaire électronique des mots composés”, in COURTOIS B., SILBERZTEIN M. (eds), Dictionnaires électroniques du français, Larousse, Paris, pp. 11–22, 1990.
  107. [SIL 93a] SILBERZTEIN M., Dictionnaires électroniques et analyse automatique de textes: le système INTEX, Masson, Paris, 1993.
  108. [SIL 93b] SILBERZTEIN M., “Groupes nominaux libres et noms composés lexicalisés”, Linguisticae Investigationes, vol. XVII, no. 2, pp. 405–425, 1993.
  109. [SIL 95] SILBERZTEIN M., “Dictionnaires électroniques et comptage des mots”, 3es Journées internationales d’analyse statistique des données textuelles (JADT), Rome, 1995.
  110. [SIL 03a] SILBERZTEIN M., NooJ Manual, available at: www.nooj4nlp.net, 2003.
  111. [SIL 03b] SILBERZTEIN M., “Finite-State Recognition of the French determiner system”, Journal of French Language Studies, Cambridge University Press, pp. 221–246, 2003.
  112. [SIL 06] SILBERZTEIN M., “NooJ’s linguistic annotation engine”, in KOEVA S. et al. (ed.), INTEX/NooJ pour le Traitement automatique des langues, pp. 9–26, Presses universitaires de Franche-Comté, 2006.
  113. [SIL 07] SILBERZTEIN M., “An alternative approach to tagging”, in KEDAD Z. et al. (ed.), Proceedings of NLDB 2007, pp. 1–11, LNCS series, Springer-Verlag, 2007.
  114. [SIL 08] SILBERZTEIN M., “Complex annotations with NooJ”, in BLANCO X., SILBERZTEIN M. (ed.), Proceedings of the International NooJ Conference, pp. 214–227, Barcelona, Cambridge Scholars Publishing, Newcastle, 2008.
  115. [SIL 09] SILBERZTEIN M., “Disambiguation tools for NooJ”, in SILBERZTEIN M., VÁRADI T. (eds), Proceedings of the 2008 International NooJ Conference, pp. 158–171, Cambridge Scholars Publishing, Newcastle, 2009.
  116. [SIL 10] SILBERZTEIN M., “Syntactic parsing with NooJ”, in HAMADOU B., SILBERZTEIN M. (eds), Finite-State Language Engineering: NooJ 2009 International Conference and Workshop, Centre for University Publication, Tunisia, 2010.
  117. [SIL 11] SILBERZTEIN M., “Automatic transformational analysis and generation”, Proceedings of the 2010 International Conference and Workshop, pp. 221–231, Greece, 2011.
  118. [SIL 15] SILBERZTEIN M., “The DEM and the LVF dictionaries in NooJ”, in MONTELEONE M., MONTI J., PIA DI BUONO M. et al. (eds), Formalizing Natural Languages with NooJ 2014, Cambridge Scholars Publishing, 2015.
  119. [SLA 07] SLAYDEN G., How to use a Thai dictionary, available at: thai-language.com, 2007.
  120. [SMI 14] SMITH G., Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics, Overlook Hardcover, p. 304, 2014.
  121. [STE 77] STEELE, G. “Debunking the ‘Expensive Procedure Call’ Myth or ‘Procedure Call Implementations Considered Harmful’ or, ‘LAMDBA: The Ultimate GOTO’”, Massachusetts Institute of Technology, Cambridge, MA, 1977.
  122. [THO 68] THOMPSON K., “Regular expression search algorithm”, Communications of the ACM, vol. 11, no. 6, pp. 419–422, 1968.
  123. [TOP 01] TOPPING S., The secret life of Unicode: a peek at Unicode’s soft underbelly, available at: www.ibm.com/developerworks/java/library/u-secret.html, 2001.
  124. [TRO 12] TROUILLEUX F., “A new French dictionary for NooJ: le DM”, in VUČKOVIC K. et al. (ed.), Selected Papers from the 2011 International NooJ Conference, Cambridge Scholar Publishing, Newcastle, 2012.
  125. [TRO 13] TROUILLEUX F., “A description of the French nucleus VP using cooccurrence contraints”, in DONABÉDIAN A. et al. (ed.), Formalising Natural Languages with NooJ, Selected Papers from the NooJ 2012 International Conference, Cambridge Scholars Publishing, 2013.
  126. [TUR 37] TURING A., “On Computable Numbers, with an Application to the Entscheidungsproblem”, Proc. London Math. Soc., 2nd series, vol. 42, pp. 230–265, 1937.
  127. [VIE 08] VIETRI S., “The formalization of Italian lexicon-grammars tables in a Nooj pair dictionary/grammar”, Proceedings of the International NooJ Conference, Budapest, Cambridge Scholars Publishing, Newcastle, 8–10 June 2008.
  128. [VIE 10] VIETRI S., “Building structural trees for frozen sentences”, Proceedings of the NooJ 2009 International Conference and Workshop, pp. 219–230, Sfax, University Publication Center, 2010.
  129. [VIJ 94] VIJAY SHANKER K., WEIR D., “The equivalence of four extensions of context-free grammars”, Mathematical Systems Theory, vol. 27, no. 6, pp. 511–546, 1994.
  130. [VOL 11] VOLOKH A., NEUMANN G., “Automatic detection and correction of errors in dependency tree-banks”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 2, pp. 346–350, 2011.
  131. [WU 10] WU M., “Integrating a dictionary of psychological verbs into a French-Chinese MT system”, Proceedings of the NooJ 2009 International Conference and Workshop, pp. 315–328, Sfax, University Publication Center, 2010.

Index

  • $THIS special variable
  • +EXCLUDE special operator
  • +ONCE special operator
  • +ONE special operator
  • +UNAMB special operator

A, B

  • abbreviation
  • accent
  • agglutination
  • agreement
  • alphabetical order
  • ambiguity
  • apostrophe
  • ASCII
  • atomic linguistic unit (ALU)
  • binary notation
  • bit
  • byte

C, D

  • characteristic constituent
  • chinese character
  • Chomsky-Schützenberger hierarchy
  • composite character
  • compound noun
  • conjugation
  • context-free grammar (CFG)
  • context-sensitive grammar (CSG)
  • contextual constraint
  • contraction
  • corpus linguistics
  • dash
  • decimal notation
  • delimiter
  • dependency tree
  • descriptive linguistics
  • dictionnaire electronique des mots (DEM)
  • digitization
  • discontinuous annotation
  • distributional class

E, F, G, H

  • editorial dictionary
  • elision
  • extensible markup language (XML)
  • finite-state
    • automaton (FSA)
    • transducer (FST)
  • free language
  • generative grammar
  • global variable
  • head-driven phrase structure grammar (HPSG)
  • hexadecimal notation
  • hyphen

I, K, L

  • inheritance
  • INTEX
  • ISO 639
  • Kleene (theorem)
  • laboratoire d’automatique documentaire et linguistique (LADL)
  • left recursion
  • les verbes français (LVF)
  • lexical
    • analysis
    • functional grammar (LFG)
    • symbol
  • light verb
  • linguistic formalism
  • local grammar
  • local variable

M, N, P

  • machine translation (MT)
  • middle recursion
  • morpheme
  • morphological analysis
  • natural language processing (NLP)
  • neologism
  • numeral
  • parse tree
  • parser, parsing
  • predicative noun
  • proper name

Q, R, S

  • question answering
  • quote
  • recursion
  • recursive graph
  • recursively enumerable language
  • reduplication
  • regular expression
  • right recursion
  • Roman numeral
  • semantic
    • analysis
    • criterion
    • predicate
  • spelling
  • structural grammar
  • structured annotation
  • syntactic
    • analysis
    • symbol
  • syntax tree

T, U, V, X

  • text annotation structure (TAS)
  • transformational analysis
  • unicode
  • unrestricted grammar
  • variation
  • verb group
  • xerox finite state tool (XFST)

Conclusion

This book describes a research project at the heart of linguistics: formalizing five levels of phenomena linked to the use of written language: orthography, vocabulary, morphology, syntax and semantics. This project involves defining the basic elements at each level (letters, ALUs, morphemes, phrases and predicates) as well as a system of rules that enables these elements to be assembled at each linguistic level.

Computers are an ideal tool for this project:

  1. 1) the number of elements of vocabulary is in the order of several hundred thousand,
  2. 2) the variety of phenomena to formalize just vocabulary recognition in texts necessitates developing a number of complex procedures and linking them to one another,
  3. 3) many variants and exceptions to most general rules (contraction, elision, capitals, ligatures, inflexion, derivation, etc.) need to be taken into account;
  1. 1) regular languages can be described mathematically with the help of regular expressions and finite-state graphs,
  2. 2) context-free languages can be described by context-free grammars and recursive graphs,
  3. 3) context-sensitive languages can be described by context-sensitive grammars,
  4. 4) all recursive enumerable languages can be described by unrestricted grammars;

– computers offer linguists the possibility of parsing large quantities of texts. We saw in the third section how to apply formalized linguistic resources (in the form of electronic dictionaries and grammars) in cascade to parse texts from lexical, morphological, syntactic and semantic perspectives, with each intermediary analysis being stored in the text annotation structure (TAS).

Lexical analysis takes care of several problems that can be represented by regular grammars and therefore processed by finite-state graphs. The number of graphs to be built, their size and interactions mean that a full lexical analysis is far from being simple to implement: a complex software program is therefore necessary to identify the atomic linguistic units (ALUs) in texts automatically.

Syntactic analysis as it is traditionally known involves several levels of analyses, using local grammars (which handle small contexts), full generative grammars (which compute sentences’ structure) as well as contextual constraints (which verify various types of ALU compatibility within phrases and sentences). The first two kinds of phenomena can be handled by finite-state graphs and by recursive graphs, whereas as agreements are much better handled by context-sensitive grammars.

Formalizing languages requires systematically describing links between elementary sentences and the complex, transformed sentences that actually occur in real-world texts. This description does not require the construction of a specific transformational grammar, as generative grammars used to carry out syntactic analyses already enable transformational parsers and generators to be implemented. Having automatic transformational parsers and generators opens the way to implementing great software applications: linguistic semantic analyzers1, question-answering applications, Machine Translation, etc.

Building a formalized description of vocabulary (in the form of electronic dictionaries) and grammar (in the form of regular, context-free and context-sensitive grammars) is now within our reach, and the effort devoted to such a project would not be colossal, especially compared to the effort made to build large-size tagged corpora used by statistical NLP applications.

PART 1
Linguistic Units

In this part I will demonstrate how to define, characterize, and formalize the basic linguistic units that comprise the alphabet (Chapter 2) and vocabulary (Chapter 3) of a language. The description of vocabulary, which has a long linguistic tradition, is generally achieved through the construction of dictionaries. I show in Chapter 4 that traditional linguistic descriptions are not suitable for our project: to formalize languages, we will need to construct dictionaries of a new kind: electronic dictionaries.

PART 2
Languages, Grammars and Machines

We have seen how to formalize the base elements of language: the letters and characters that constitute the alphabet of a language, and the Atomic Linguistic Units (ALUs) that constitute its vocabulary. I have shown in particular that it is necessary to distinguish the concept of an ALU from that of a word form: ALUs are elements of vocabulary (affixes, simple and multiword units, and expressions); they are represented by entries in an electronic dictionary, and are the objects around which phrases and sentences are constructed.

We turn now to the mechanisms used to combine the base linguistic units with one another. Combining letters, prefixes, or suffixes to construct word forms uses morphological rules, while combining ALUs to construct phrases or sentences uses syntactic rules. Note that unifying the types of ALUs has freed us from the arbitrary limits imposed by the use of space delimiters, and so it is all of a language’s rules (both morphological and syntactic) that constitute its grammar. This part of the book is dedicated to the study of grammars.

In his seminal book, [CHO 57] first posed the problem of the formalization of grammars and their adequacy for describing natural languages. The study of formal grammars and of the machines that implement them is part of the standard curriculum of computer science departments in universities today.

In Chapter 5, I will define the base concepts (language, grammar, machine) and then present the four types of grammars that make up the Chomsky-Schützenberger hierarchy. Each of the chapters after that will be dedicated to types of languages/grammars/machines: regular grammars (Chapter 6), context-free grammars (Chapter 7), context-sensitive grammars (Chapter 8), and unrestricted grammars (Chapter 9).