[antlr-interest] Lexing almost arbitrary input

Wed Oct 24 04:15:48 PDT 2012

Hi,

thanks Jim and George for your answers.

George, interesting ideas. Unfortunately they're not applicable in my  
situation. But very interesting. I'll keep it in mind for possible  
future projects.
Jim, the language is not really a design of my own. I will get the  
input from a SQL-dump of the database that is the backend of this  
online dictionary: http://www.pledarigrond.ch/simpel.php
The content of the database will be converted into the structure I'm  
describing with the grammar. The requirement for the grammar is to be  
as simple as possible. Later, speakers of the languages involved, who  
have no programming background, are supposed to submit new entries via  
an interactive editor. This is the reason that we don't want  
syntactical overhead with 'strange' symbols.
I also know that this sounds like a very strange task for ANTLR, but  
my colleagues and boss say we need the fancy error reporting and AST  
creation stuff...

Jim, your suggestion is simple and works. But I really need to include  
the grammatical attributes that belong to the second phrase, like in

phrase1 : phrase2 [attribute]

So I changed your suggestion to

dictEntry:
	lemma=PHRASE SEP translation=PHRASE grammarAtts NEWLINE;

grammarAtts:
   	'[' grammarAttList? ']' ;

PHRASE:	(~(':'|'['))+ ;
SEP: ':' ;

where grammarAttList finally contains fixed literals as 'f', 'm' etc.  
So an entry ends in ]\r\n. I now get the following error:

line 1:17 mismatched input 'm]\r\n' expecting ']'

Obviously, ']' is there, so where's the problem?

Thanks again in advance,
Mandy

Zitat von Jim Idle <jimi at temporal-wave.com>:

> From your description here, this language cannot be parsed. Is a
> design of your own, in which case it can be changed, or something you
> are stuck with. You have to have something to disambuguate such as
>
> def : Phrase SEP Phrase Semi ;
>
> Semi : ';' ;
> Sep : '::' ;
> Phrase : (~(':'|';'))+ ;
>
> But then the problem is so simple that there is no point using a
> grammar. I would just hand code this as a simple character consuming
> loop.
>
> Hope that helps :)
>
> Jim
>
> On Oct 22, 2012, at 19:00, "mandy at think-a-lot.de"  
> <mandy at think-a-lot.de> wrote:
>
>> Dear list,
>>
>> in a project we want to use ANTLR to parse lexicon/dictionary entries.
>> I'm the one who has the honour of writing the grammar(s) for that.
>> I'm currently stuck with the lexer part.
>> Here's the problem:
>>
>> Since we talk about dictionary entries, the structure is quite simple:
>> You have a word in language1 (lemma), a translation in language2 and
>> some grammatical attributes. The latter is somewhat fixed, having a
>> limited set of values like 'm', 'f', 'pl' and so on.
>> The problem is the former. The lemma (and translation) could be a
>> simple word like "dog", but it can also be several words with spaces
>> (phrases) like in "come to be known"; furthermore it could contain
>> non-letter characters like '-' ("push-up"), '(' ("Rheinländer(in)"),
>> even numbers, slashes, percent signs etc. may be part of the lemma
>> (e.g. "100% (bio-)degradable").
>>
>> So there are just too many possibilities - I did not come too far with
>> the 'a'..'z' approach (even more because we are talking about
>> languages with umlaut and accents). And I really did not want to list
>> all possible combinations, think it would be a pain...
>>
>> I thought about something like "consume just everything until some
>> special character (that will never be part of the lemma)". First rules
>> I tried were
>>
>> LEMMA: (options {greedy=false;}: .)+ ~COLON;
>> TRANSLATION: (options {greedy=false;}: .)+ ~(CARRIAGERETURN|LINEFEED);
>>
>> but this didn't seem to work ("required (...)+ loop did not match
>> anything at character ..." for each input character). So I used just
>>
>> LEMMA: (~COLON)+;
>> TRANSLATION: (~(CARRIAGERETURN|LINEFEED))+;
>>
>> but now I don't see any output - neither from my code actions nor the
>> AST. So I'm not sure if it even works; plus I think this is not the
>> very best way to handle the problem.
>>
>> Any ideas?
>>
>> Mandy
>>
>> P.S.: The structure for the dictionary entry has to be as simple as this:
>>
>> dictionary:
>>    dictEntry*  EOF
>> ;
>>
>> dictEntry
>> :
>>    LEMMA
>>
>>    COLON
>>
>>    TRANSLATION
>>
>>    grammarAtts //which is '[' list_of_atrributes ']'
>>
>>    NEWLINE //my instructor wants to let an entry end with newline, not
>> sth like ';' ...
>> ;
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:  
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:  
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address