[antlr-interest] Parsing a file

Tue Nov 11 21:59:55 PST 2008

Ok, let me rephrase my previous question.

Let's say that I have simple grammar for a parser of the following form:

grammar Hour;

time
  :
     'at' HOURNUM ('pm' | 'am' | 'p' | 'a')
  ;

HOURNUM
  :
    '0'..'12'
  ;

WS
  :
    ' ', | '\t' | '\n' {$channel = HIDDEN;}
  ;

I want to use this grammar to extract ALL instances of times in any 
given input sentence.

For example:

1) at 4pm
2) at 8a
3) at 10p
4) at 11am

or even sentences like:

6) for 5 people at 8pm
7) find me 3 pieces at 6pm

All parse fine and return a time date structure with startIndex and 
endIndex information.  However, the following sentence causes problems

8) at 2 times the cost at 5pm

notice that while the previous two sentences (6 and 7) above have "5 p" 
and "3 p" which are partial matches to the rule, the 'at' token is 
missing from the beginning, so automatic error recovery would skip all 
input until it finds a valid token that matches the beginning of a valid 
input stream, in this case "at 8pm" and "at 6pm", which begin with the 
'at' literal.

The problem is in the following sentence because "at 2", matches the 
rule at the first 2 tokens, so ANTLR would report an error and exit. 

The behavior I want for my parser is that even if it finds partial 
matches such as this, I would like to throw away the partially matched 
tokens, in this case 'at' and '2' and then continue with the rest of the 
input trying to parse it along the way, thus eventually reaching "at 
5pm" and matching it correctly.  The default behavior, I believe is that 
since it does not find anything that matches '('am' | 'pm' | 'a' | 'p'), 
but instead it eventually finds another 'at' and '5' it fails and 
exits.   Is it possible to override this behavior and allow some kind of 
resync of the input so that it throws away the 'at' '2' and tries to 
parse the remaining sentence? That is "times the cost at 5 pm".

What do I need to do to achieve this behavior at the parser level?  I 
believe this is similar to what the lexer is doing to tokenize all the 
input until the end of the input. I want to do likewise, not at the 
lexical level, but at the syntactical  level. I have seen the example 
for extracting comments in the documentation and wiki, but since my 
domain-specific language does not have a definite termination token 
(such as the "*/" for end of comment), I can't use that approach.  I 
have seen some clues of how to do this by catching the exceptions 
associated and then doing something with the input, but I have not been 
able to figure it out. 

Can anyone help with this one?

Thanks in advance,

Yuri Tijerino