[antlr-interest] Lexing almost arbitrary input

mandy at think-a-lot.de mandy at think-a-lot.de
Mon Oct 22 04:00:35 PDT 2012


Dear list,

in a project we want to use ANTLR to parse lexicon/dictionary entries.  
I'm the one who has the honour of writing the grammar(s) for that.
I'm currently stuck with the lexer part.
Here's the problem:

Since we talk about dictionary entries, the structure is quite simple:  
You have a word in language1 (lemma), a translation in language2 and  
some grammatical attributes. The latter is somewhat fixed, having a  
limited set of values like 'm', 'f', 'pl' and so on.
The problem is the former. The lemma (and translation) could be a  
simple word like "dog", but it can also be several words with spaces  
(phrases) like in "come to be known"; furthermore it could contain  
non-letter characters like '-' ("push-up"), '(' ("Rheinländer(in)"),  
even numbers, slashes, percent signs etc. may be part of the lemma  
(e.g. "100% (bio-)degradable").

So there are just too many possibilities - I did not come too far with  
the 'a'..'z' approach (even more because we are talking about  
languages with umlaut and accents). And I really did not want to list  
all possible combinations, think it would be a pain...

I thought about something like "consume just everything until some  
special character (that will never be part of the lemma)". First rules  
I tried were

LEMMA: (options {greedy=false;}: .)+ ~COLON;
TRANSLATION: (options {greedy=false;}: .)+ ~(CARRIAGERETURN|LINEFEED);

but this didn't seem to work ("required (...)+ loop did not match  
anything at character ..." for each input character). So I used just

LEMMA: (~COLON)+;
TRANSLATION: (~(CARRIAGERETURN|LINEFEED))+;

but now I don't see any output - neither from my code actions nor the  
AST. So I'm not sure if it even works; plus I think this is not the  
very best way to handle the problem.

Any ideas?

Mandy

P.S.: The structure for the dictionary entry has to be as simple as this:

dictionary:
	dictEntry*  EOF
;

dictEntry
:
	LEMMA

	COLON

	TRANSLATION

	grammarAtts //which is '[' list_of_atrributes ']'

	NEWLINE //my instructor wants to let an entry end with newline, not  
sth like ';' ...
;



More information about the antlr-interest mailing list