[antlr-interest] Lexer consumes input but doesn't emit all tokens

Андрей Асеев andron-eiu at mail.ru
Wed Aug 8 03:03:03 PDT 2012


Hello, Glenn.

With a closer look to your problem, I found reasons.
ANTLR algorith uses greedy method for parsing loops.
And ':' token activate inner loop in ALPHA_NUM and not backtrack on 
unexpected EOF.

I hacked my brain for an hour to help you...
Only way I invented is:

Parse NAME_LITERALS without token ':'

NAME_LITERAL
        :    '\\'? ALPHA_NUM ( ( '_' | '-' | ALPHA_NUM )* ALPHA_NUM )? ;


Then, process lexer stream in your target language.
Somehow you may process it and change token sequence
<NAME_LITERAL>a, <COLON>col, <NAME_LITERAL>b
to
<NAME_LITERAL>(a+col+b)

or vice versa, allow token ':' on end

NAME_LITERAL
        :    '\\'? ALPHA_NUM ( ':' | '_' | '-' | ALPHA_NUM )*;

then process lexer stream and split last special chars to separated 
tokens manually
<NAME_LITERAL>"abc:"
to
<NAME_LITERAL>"abc", <COLON>":"


08.08.2012 3:39, Glenn McGregor пишет:
> On 8/7/2012 3:05 PM, Андрей Асеев wrote:
>> It would be, at example, if you choose incorrect rule in ANTLRWorks
>> interpreter rule box. :)
>>
>> Or show there your gramar rule you use to parse test input.
> If i change the input to 'test:ack :', it parses just fine, and returns
> appropriate tokens.
>
>
> The grammar is about 500 lines long, and I tried to show just the
> hopefully relevant entries.
>
> But it the rule in AntlrWorks starts at my
>
> start_program
>       :    program EOF! ;
>
>
> I can post my grammar somewhere if it becomes necessary to pursue this.
>
>
> The output of the interpreter shows  (in bad ascii art)
>
> <grammar Tal>
>       start_program
>           program
>               <epsilon>
>           t
>
> with the altered input, I get
>
> <grammar Tal>
>       start_program
>           program
>               statement
>                   label_statement
>                       string_literal
>                           test:ack
>                       :
> <EOF>
>
> Thanks
>
> Glenn
>
>>> Given the partial grammar from a much larger...
>>>
>>>
>>> tokens { COLON = ':' }
>>>
>>> fragment
>>> ALPHA_NUM
>>>         :    'A'..'Z' | 'a'..'z' | '0'..'9';
>>>
>>> NAME_LITERAL
>>>         :    '\\'? ALPHA_NUM ( ( ':' | '_' | '-' | ALPHA_NUM )* ALPHA_NUM )? ;
>>>
>>> ANY    :    . ;
>>>
>>>
>>>
>>> I would like the input
>>>
>>> test:ack:
>>>
>>> to arrive as two tokens, a NAME_LITERAL of 'test:ack', and a COLON.
>>>
>>> Instead, this input disappears entirely, but parses successfully.
>>>
>>> Any suggestions?
>>>
>>> Glenn McGregor
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address



More information about the antlr-interest mailing list