[antlr-interest] Lexing almost arbitrary input

Mon Oct 22 05:04:41 PDT 2012

>From your description here, this language cannot be parsed. Is a
design of your own, in which case it can be changed, or something you
are stuck with. You have to have something to disambuguate such as

def : Phrase SEP Phrase Semi ;

Semi : ';' ;
Sep : '::' ;
Phrase : (~(':'|';'))+ ;

But then the problem is so simple that there is no point using a
grammar. I would just hand code this as a simple character consuming
loop.

Hope that helps :)

Jim

On Oct 22, 2012, at 19:00, "mandy at think-a-lot.de" <mandy at think-a-lot.de> wrote:

> Dear list,
>
> in a project we want to use ANTLR to parse lexicon/dictionary entries.
> I'm the one who has the honour of writing the grammar(s) for that.
> I'm currently stuck with the lexer part.
> Here's the problem:
>
> Since we talk about dictionary entries, the structure is quite simple:
> You have a word in language1 (lemma), a translation in language2 and
> some grammatical attributes. The latter is somewhat fixed, having a
> limited set of values like 'm', 'f', 'pl' and so on.
> The problem is the former. The lemma (and translation) could be a
> simple word like "dog", but it can also be several words with spaces
> (phrases) like in "come to be known"; furthermore it could contain
> non-letter characters like '-' ("push-up"), '(' ("Rheinländer(in)"),
> even numbers, slashes, percent signs etc. may be part of the lemma
> (e.g. "100% (bio-)degradable").
>
> So there are just too many possibilities - I did not come too far with
> the 'a'..'z' approach (even more because we are talking about
> languages with umlaut and accents). And I really did not want to list
> all possible combinations, think it would be a pain...
>
> I thought about something like "consume just everything until some
> special character (that will never be part of the lemma)". First rules
> I tried were
>
> LEMMA: (options {greedy=false;}: .)+ ~COLON;
> TRANSLATION: (options {greedy=false;}: .)+ ~(CARRIAGERETURN|LINEFEED);
>
> but this didn't seem to work ("required (...)+ loop did not match
> anything at character ..." for each input character). So I used just
>
> LEMMA: (~COLON)+;
> TRANSLATION: (~(CARRIAGERETURN|LINEFEED))+;
>
> but now I don't see any output - neither from my code actions nor the
> AST. So I'm not sure if it even works; plus I think this is not the
> very best way to handle the problem.
>
> Any ideas?
>
> Mandy
>
> P.S.: The structure for the dictionary entry has to be as simple as this:
>
> dictionary:
>    dictEntry*  EOF
> ;
>
> dictEntry
> :
>    LEMMA
>
>    COLON
>
>    TRANSLATION
>
>    grammarAtts //which is '[' list_of_atrributes ']'
>
>    NEWLINE //my instructor wants to let an entry end with newline, not
> sth like ';' ...
> ;
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address