[antlr-interest] Lexing almost arbitrary input

Wed Oct 24 07:59:29 PDT 2012

Arbitrary text is very hard, but structured text--HTML or RTF with title fields preceding text--is quite feasible, although the grammars get a bit messy (you have to have rules that parse the structure of interest, and have other very similar rules to parse "random" phrases that have similar structure but should be ignored).  Getting users to enter text with a word processor with a specific format is easy, and any decent word processor allows you to define forms for data entry.  I took this approach for parsing the documentation for the Cassini spacecraft command language a few years ago, and it worked well although the grammar necessarily was riddled with syntactic predicates.

From a user perspective, the data entry "grammar" can be made as simple as possible, but the actual grammar will not be.

--Loring

>________________________________
> From: "mandy at think-a-lot.de" <mandy at think-a-lot.de>
>To: antlr-interest <antlr-interest at antlr.org> 
>Sent: Wednesday, October 24, 2012 4:15 AM
>Subject: Re: [antlr-interest] Lexing almost arbitrary input
> 
>Hi,
>
>thanks Jim and George for your answers.
>
>George, interesting ideas. Unfortunately they're not applicable in my  
>situation. But very interesting. I'll keep it in mind for possible  
>future projects.
>Jim, the language is not really a design of my own. I will get the  
>input from a SQL-dump of the database that is the backend of this  
>online dictionary: http://www.pledarigrond.ch/simpel.php
>The content of the database will be converted into the structure I'm  
>describing with the grammar. The requirement for the grammar is to be  
>as simple as possible. Later, speakers of the languages involved, who  
>have no programming background, are supposed to submit new entries via  
>an interactive editor. This is the reason that we don't want  
>syntactical overhead with 'strange' symbols.
>I also know that this sounds like a very strange task for ANTLR, but  
>my colleagues and boss say we need the fancy error reporting and AST  
>creation stuff...
>
>Jim, your suggestion is simple and works. But I really need to include  
>the grammatical attributes that belong to the second phrase, like in
>
>phrase1 : phrase2 [attribute]
>
>So I changed your suggestion to
>
>dictEntry:
>    lemma=PHRASE SEP translation=PHRASE grammarAtts NEWLINE;
>
>grammarAtts:
>       '[' grammarAttList? ']' ;
>
>PHRASE:    (~(':'|'['))+ ;
>SEP: ':' ;
>
>where grammarAttList finally contains fixed literals as 'f', 'm' etc.  
>So an entry ends in ]\r\n. I now get the following error:
>
>line 1:17 mismatched input 'm]\r\n' expecting ']'
>
>Obviously, ']' is there, so where's the problem?
>
>Thanks again in advance,
>Mandy
>
>Zitat von Jim Idle <jimi at temporal-wave.com>:
>
>> From your description here, this language cannot be parsed. Is a
>> design of your own, in which case it can be changed, or something you
>> are stuck with. You have to have something to disambuguate such as
>>
>> def : Phrase SEP Phrase Semi ;
>>
>> Semi : ';' ;
>> Sep : '::' ;
>> Phrase : (~(':'|';'))+ ;
>>
>> But then the problem is so simple that there is no point using a
>> grammar. I would just hand code this as a simple character consuming
>> loop.
>>
>> Hope that helps :)
>>
>> Jim
>>
>> On Oct 22, 2012, at 19:00, "mandy at think-a-lot.de"  
>> <mandy at think-a-lot.de> wrote:
>>
>>> Dear list,
>>>
>>> in a project we want to use ANTLR to parse lexicon/dictionary entries.
>>> I'm the one who has the honour of writing the grammar(s) for that.
>>> I'm currently stuck with the lexer part.
>>> Here's the problem:
>>>
>>> Since we talk about dictionary entries, the structure is quite simple:
>>> You have a word in language1 (lemma), a translation in language2 and
>>> some grammatical attributes. The latter is somewhat fixed, having a
>>> limited set of values like 'm', 'f', 'pl' and so on.
>>> The problem is the former. The lemma (and translation) could be a
>>> simple word like "dog", but it can also be several words with spaces
>>> (phrases) like in "come to be known"; furthermore it could contain
>>> non-letter characters like '-' ("push-up"), '(' ("Rheinländer(in)"),
>>> even numbers, slashes, percent signs etc. may be part of the lemma
>>> (e.g. "100% (bio-)degradable").
>>>
>>> So there are just too many possibilities - I did not come too far with
>>> the 'a'..'z' approach (even more because we are talking about
>>> languages with umlaut and accents). And I really did not want to list
>>> all possible combinations, think it would be a pain...
>>>
>>> I thought about something like "consume just everything until some
>>> special character (that will never be part of the lemma)". First rules
>>> I tried were
>>>
>>> LEMMA: (options {greedy=false;}: .)+ ~COLON;
>>> TRANSLATION: (options {greedy=false;}: .)+ ~(CARRIAGERETURN|LINEFEED);
>>>
>>> but this didn't seem to work ("required (...)+ loop did not match
>>> anything at character ..." for each input character). So I used just
>>>
>>> LEMMA: (~COLON)+;
>>> TRANSLATION: (~(CARRIAGERETURN|LINEFEED))+;
>>>
>>> but now I don't see any output - neither from my code actions nor the
>>> AST. So I'm not sure if it even works; plus I think this is not the
>>> very best way to handle the problem.
>>>
>>> Any ideas?
>>>
>>> Mandy
>>>
>>> P.S.: The structure for the dictionary entry has to be as simple as this:
>>>
>>> dictionary:
>>>    dictEntry*  EOF
>>> ;
>>>
>>> dictEntry
>>> :
>>>    LEMMA
>>>
>>>    COLON
>>>
>>>    TRANSLATION
>>>
>>>    grammarAtts //which is '[' list_of_atrributes ']'
>>>
>>>    NEWLINE //my instructor wants to let an entry end with newline, not
>>> sth like ';' ...
>>> ;
>>>
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:  
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:  
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>
>List: http://www.antlr.org/mailman/listinfo/antlr-interest
>Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>