[antlr-interest] Lexing almost arbitrary input
mandy at think-a-lot.de
mandy at think-a-lot.de
Mon Oct 22 04:00:35 PDT 2012
Dear list,
in a project we want to use ANTLR to parse lexicon/dictionary entries.
I'm the one who has the honour of writing the grammar(s) for that.
I'm currently stuck with the lexer part.
Here's the problem:
Since we talk about dictionary entries, the structure is quite simple:
You have a word in language1 (lemma), a translation in language2 and
some grammatical attributes. The latter is somewhat fixed, having a
limited set of values like 'm', 'f', 'pl' and so on.
The problem is the former. The lemma (and translation) could be a
simple word like "dog", but it can also be several words with spaces
(phrases) like in "come to be known"; furthermore it could contain
non-letter characters like '-' ("push-up"), '(' ("Rheinländer(in)"),
even numbers, slashes, percent signs etc. may be part of the lemma
(e.g. "100% (bio-)degradable").
So there are just too many possibilities - I did not come too far with
the 'a'..'z' approach (even more because we are talking about
languages with umlaut and accents). And I really did not want to list
all possible combinations, think it would be a pain...
I thought about something like "consume just everything until some
special character (that will never be part of the lemma)". First rules
I tried were
LEMMA: (options {greedy=false;}: .)+ ~COLON;
TRANSLATION: (options {greedy=false;}: .)+ ~(CARRIAGERETURN|LINEFEED);
but this didn't seem to work ("required (...)+ loop did not match
anything at character ..." for each input character). So I used just
LEMMA: (~COLON)+;
TRANSLATION: (~(CARRIAGERETURN|LINEFEED))+;
but now I don't see any output - neither from my code actions nor the
AST. So I'm not sure if it even works; plus I think this is not the
very best way to handle the problem.
Any ideas?
Mandy
P.S.: The structure for the dictionary entry has to be as simple as this:
dictionary:
dictEntry* EOF
;
dictEntry
:
LEMMA
COLON
TRANSLATION
grammarAtts //which is '[' list_of_atrributes ']'
NEWLINE //my instructor wants to let an entry end with newline, not
sth like ';' ...
;
More information about the antlr-interest
mailing list