[antlr-interest] Getting the Previously Matched Lexer Token in the C Target

Mon Jul 19 18:00:32 PDT 2010

Hello, Kirby Bohling.

It's similar to Keyword Vs. ID, but not exact. Consider the following inputs:

-arg#hashed#
Result:
ARGUMENT (Text="arg")
ARGEXTRA (Text="hashed")

-arg#hashed# #otherData#
Result:
ARGUMENT (Text="arg")
ARGEXTRA (Text="hashed")
OTHER (Text="#otherdata#")  <-- Note that the hashes need to be
included at this point, but excluded in the ARGEXTRA token type

#otherData#andsomemorethings
Result:
OTHER (Text="#otherData#andsomemorethings")  <-- If I just use a
common token for that, then there needs to be a lot of stitching going
on in the parser, posing a problem.

Finally, this:
-arg #hashed#
needs to be:
ARGUMENT (Text="arg")
OTHER (Text="hashed")

If I use a common token for things there, then the parser can't
correctly discern what to do here -- stitching together here would
actually be invalid because of the space, and because the whitespace
is dropped by the lexer, the parser cannot make that determination.

Does that make more sense?

Billy3
--------------------------------------------------------------
Intern - PreEmptive Solutions, LLC
Malware Response Instructor - BleepingComputer.com
Analyst, Security Team - TechSupportForum.com

On Mon, Jul 19, 2010 at 8:49 PM, Kirby Bohling <kirby.bohling at gmail.com> wrote:
> If you aren't going to lex that token to something else, I'm pretty
> sure the right solution is to just lex it as the invalid token.  If
> you are going to lex it to something else, then likely you have the
> "keyword" vs. "id" problem (use one token for it in the lexer, and
> pick which one it really is during parsing).  Which I believe is best
> resolved by a gated predicate in the parser.  If it "looks" like legal
> code, you'd be better of actually generating the error at the semantic
> check phase.  You have a lot more context in order to generate a
> useful error message at that point.
>
> Kirby
>
>
> On Mon, Jul 19, 2010 at 7:23 PM, Billy O'Neal <billy.oneal at gmail.com> wrote:
>> Hello, Everyone :)
>>
>> Was referred here from my StackOverflow question:
>> http://stackoverflow.com/questions/3278338/using-the-antlr-c-target-how-can-i-get-the-previously-matched-token-in-the-lexer
>>
>> I'm quite new to ANTLR; and my Lexer needs to have a gated rule which
>> makes it valid if and only if it occurs directly after another rule.
>> If there's a way to get the previously emitted token type, that would
>> make that gating easy. Otherwise I have to fail over to a nasty hack
>> of turning the boolean flag off after every lexer rule.
>>
>> Is it simple/easy to get that information in a lexer rule predicate?
>>
>> Billy3
>> --------------------------------------------------------------
>> Intern - PreEmptive Solutions, LLC
>> Malware Response Instructor - BleepingComputer.com
>> Analyst, Security Team - TechSupportForum.com
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>