[antlr-interest] Lexing almost arbitrary input

Jim Idle jimi at temporal-wave.com
Mon Oct 29 01:05:33 PDT 2012


So, re-looking at your input, it seems that all your examples surround
your words in "", you are doing this naturally to show the delineation in
English. If you do the same with your language, this becomes a trivial
issue. Otherwise you will have to use code in the lexer anyway I think.

Jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of
mandy at think-a-lot.de
Sent: Monday, October 22, 2012 7:01 PM
To: antlr-interest
Subject: [antlr-interest] Lexing almost arbitrary input

Dear list,

in a project we want to use ANTLR to parse lexicon/dictionary entries.
I'm the one who has the honour of writing the grammar(s) for that.
I'm currently stuck with the lexer part.
Here's the problem:

Since we talk about dictionary entries, the structure is quite simple:
You have a word in language1 (lemma), a translation in language2 and some
grammatical attributes. The latter is somewhat fixed, having a limited set
of values like 'm', 'f', 'pl' and so on.
The problem is the former. The lemma (and translation) could be a simple
word like "dog", but it can also be several words with spaces
(phrases) like in "come to be known"; furthermore it could contain
non-letter characters like '-' ("push-up"), '(' ("Rheinländer(in)"), even
numbers, slashes, percent signs etc. may be part of the lemma (e.g. "100%
(bio-)degradable").

So there are just too many possibilities - I did not come too far with the
'a'..'z' approach (even more because we are talking about languages with
umlaut and accents). And I really did not want to list all possible
combinations, think it would be a pain...

I thought about something like "consume just everything until some special
character (that will never be part of the lemma)". First rules I tried
were

LEMMA: (options {greedy=false;}: .)+ ~COLON;
TRANSLATION: (options {greedy=false;}: .)+ ~(CARRIAGERETURN|LINEFEED);

but this didn't seem to work ("required (...)+ loop did not match anything
at character ..." for each input character). So I used just

LEMMA: (~COLON)+;
TRANSLATION: (~(CARRIAGERETURN|LINEFEED))+;

but now I don't see any output - neither from my code actions nor the AST.
So I'm not sure if it even works; plus I think this is not the very best
way to handle the problem.

Any ideas?

Mandy

P.S.: The structure for the dictionary entry has to be as simple as this:

dictionary:
	dictEntry*  EOF
;

dictEntry
:
	LEMMA

	COLON

	TRANSLATION

	grammarAtts //which is '[' list_of_atrributes ']'

	NEWLINE //my instructor wants to let an entry end with newline,
not sth like ';' ...
;


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


More information about the antlr-interest mailing list