[antlr-interest] recombining tokens

Johannes Luber jaluber at gmx.de
Mon Jul 28 03:13:42 PDT 2008


Davyd Madeley schrieb:
> Hi all,
> 
> I'm currently writing a grammar in which '/' is used to append a
> qualifier to a token. Unfortunately it is also used in path parameters.
> 
> I am trying to figure out how I can recombine tokens in the case where I
> determine I'm reading a path.
> 
> e.g.
> 
> // these are my token delimiters
> TOKEN
> 	: ~(','|'>'|'*'|'/'|'('|')'|LINE_TERMINATOR)+
> 	;
> 
> At one point in the state machine, I expect to be able to start reading
> parameters ('LINE' is a special token at the start of the file, but
> after that is just a regular token):
> 
> parameter
> 	: a=TOKEN	-> PARAMETER[$a]
> 	| a='LINE'	-> PARAMETER[$a]
> 	| path		-> ^(PATH path)
> 	;
> 
> path
> 	: ('/' TOKEN)+
> 	;
> 
> Every so often, a path will be provided. Currently this will be
> tokenised around the '/', which is undesirable.
> 
> e.g.
>       PATH (9) .......................... PATH
>         '/' (20) ........................ /
>         TOKEN (11) ...................... path
>         '/' (20) ........................ /
>         TOKEN (11) ...................... to
>         '/' (20) ........................ /
>         TOKEN (11) ...................... my.file
> 
> What I want to do is be able to recombine this into a
> PARAMETER["/path/to/my.file"].
> 
> Someone spoke about a concatenation operator, but I can't find any info
> about it.
> 
> Regards,
> --davyd
> 

The root cause of the problem is that the tokenizer is independent from 
the parser, so you can't decide without extra code in the lexer if a '/' 
belongs to a qualifier or a path. This approach means also to create a 
mini-parser, which may need more context information than a pure lexer 
can provide. It may be easier to recognize in a first pass the paths as 
a series of tokens and then to rewrite it into a single one. This 
approach means that you need an AST grammar.

Another question is if you truly need a single PATH token or if you can 
use the $path.text attribute instead. Depending on your needs this may 
still perform better than the other two approaches above.

Johannes


More information about the antlr-interest mailing list