fsteeg.com | notes | tags

∞ /notes/linguistic-dsls-with-antlr | 2007-10-06 | release dsl fdg linguistics nlp

Linguistic DSLs with ANTLR

Cross-posted to: https://fsteeg.wordpress.com/2007/10/06/linguistic-dsls-with-antlr/

As an update to our Functional Grammar project I've added some experimental ANTLR v3 grammar files for Functional Discourse Grammar (FDG) structures on the Interpersonal and Representational Levels and updated the project page (with some details in an updated version of our overview paper). For example, on the RL, it is possible to generate a parser in all sorts of programming languages (thanks to ANTLR v3), that will parse linguistic descriptions on the RL, such as the following serial verb construction in Jamaican Creole (Im tek naif kot mi, 'He cut me with a knife'):
(p1:[
   (Past e1:[
       (f1:tek[
           (x1:im(x1))Ag
           (x2:naif(x2))Inst
       ](f1))
       (f2:kot[
           (x1:im(x1))Ag
           (x3:mi(x3))Pat
       ](f2))
   ](e1))
](p1))
The grammar looks like this (naming is based on FDG terminology):
pcontent   : '(' OPERATOR? 'p' X ( ':' head '(' 'p' X ')' )* ')' FUNCTION? ;
soaffairs  : '(' OPERATOR? 'e' X ( ':' head '(' 'e' X ')' )* ')' FUNCTION? ;
property   : '(' OPERATOR? 'f' X ( ':' head '(' 'f' X ')' )* ')' FUNCTION? ;
individual : '(' OPERATOR? 'x' X ( ':' head '(' 'x' X ')' )* ')' FUNCTION? ;
location   : '(' OPERATOR? 'l' X ( ':' head '(' 'l' X ')' )* ')' FUNCTION? ;
time       : '(' OPERATOR? 't' X ( ':' head '(' 't' X ')' )* ')' FUNCTION? ;
head       : LEMMA? ( '['
              ( soaffairs
              | property
              | individual
              | location
              | time )* ']' ) ? ;
And with the same short grammar it is basically possible to parse all valid RL representations. It's the same for IL structures. Such an approach makes sense for two reasons, I believe:
  1. It provides a way for linguists to define and use a real formal description language for linguistic structures on all linguistic levels, not the more common almost formal description languages which tend to differ in the details from paper to paper, even within one theory or framework
  2. It might in the long run provide a way for detailed computational representation of linguistic knowledge in a domain-specific language (DSL) specialized for the descriptive linguist and such make detailed linguistic knowledge available for natural language processing
On the practical side, where this could lead into as a next step could be creating a usable tool on the basis of the FDG grammar files, that would validate an entered structure against the grammar files and provide pretty-printed HTML output (with indices and stuff) of the structure. For integration reasons (e.g. with Tesla) this would make sense as an Eclipse Plug-in (and as a small RCP-app for simplified usage as a validator only). Also, with some interesting stuff going on around Eclipse and having ANTLR grammar files, this might be relativley easy to do even with some useful stuff like auto-complete and an outline view.