[antlr-interest] Parsing a file
Yuri Tijerino
yuri at tijerino.net
Tue Nov 11 21:59:55 PST 2008
Ok, let me rephrase my previous question.
Let's say that I have simple grammar for a parser of the following form:
grammar Hour;
time
:
'at' HOURNUM ('pm' | 'am' | 'p' | 'a')
;
HOURNUM
:
'0'..'12'
;
WS
:
' ', | '\t' | '\n' {$channel = HIDDEN;}
;
I want to use this grammar to extract ALL instances of times in any
given input sentence.
For example:
1) at 4pm
2) at 8a
3) at 10p
4) at 11am
or even sentences like:
6) for 5 people at 8pm
7) find me 3 pieces at 6pm
All parse fine and return a time date structure with startIndex and
endIndex information. However, the following sentence causes problems
8) at 2 times the cost at 5pm
notice that while the previous two sentences (6 and 7) above have "5 p"
and "3 p" which are partial matches to the rule, the 'at' token is
missing from the beginning, so automatic error recovery would skip all
input until it finds a valid token that matches the beginning of a valid
input stream, in this case "at 8pm" and "at 6pm", which begin with the
'at' literal.
The problem is in the following sentence because "at 2", matches the
rule at the first 2 tokens, so ANTLR would report an error and exit.
The behavior I want for my parser is that even if it finds partial
matches such as this, I would like to throw away the partially matched
tokens, in this case 'at' and '2' and then continue with the rest of the
input trying to parse it along the way, thus eventually reaching "at
5pm" and matching it correctly. The default behavior, I believe is that
since it does not find anything that matches '('am' | 'pm' | 'a' | 'p'),
but instead it eventually finds another 'at' and '5' it fails and
exits. Is it possible to override this behavior and allow some kind of
resync of the input so that it throws away the 'at' '2' and tries to
parse the remaining sentence? That is "times the cost at 5 pm".
What do I need to do to achieve this behavior at the parser level? I
believe this is similar to what the lexer is doing to tokenize all the
input until the end of the input. I want to do likewise, not at the
lexical level, but at the syntactical level. I have seen the example
for extracting comments in the documentation and wiki, but since my
domain-specific language does not have a definite termination token
(such as the "*/" for end of comment), I can't use that approach. I
have seen some clues of how to do this by catching the exceptions
associated and then doing something with the input, but I have not been
able to figure it out.
Can anyone help with this one?
Thanks in advance,
Yuri Tijerino
More information about the antlr-interest
mailing list