[antlr-interest] Help needed with baby LaTeX parser

Pavel Grinfeld pg at freeboundaries.com
Tue Jun 22 13:47:52 PDT 2010


I'm doing pretty well recognizing LaTeX commands, but now I'm at the 
stage where I want to capture the "text". I'm having trouble defining 
"everything else".
Basically, I currently define LaTeX as

commands (as I define them), possibly separated by WS, and everything 
that's not a command is "text". I keep running into a problem that when 
I define "text" generously, it starts grabbing tokens that belong to 
commands. Any help would be greatly appreciated!

Thanks in advance,


  I'm including what I have so far, and the document I'm hoping to parse.

grammar PGTeX;

doc : (command WS?)+ EOF;

command : escWord  cWord+ ( sWord+ cWord*)?;

sWord    : '[' word ']';
cWord    : '{' word '}';
escWord : '\\' word;

word : WORD;

WORD:    ('-'|'a'..'z'|'A'..'Z'|'0'..'9'|'\*')+;

WS  :   ( ' ' | '\t'| '\r' | '\n' )+;

     :    '%' (~('\n'|'\r'))*  {$channel = HIDDEN;};

And here's the document:



Book starts here $x^{2}+y^{2}=1$. Here's an intersting faction:
\int_{0}^{1}\sin xdx=4


More information about the antlr-interest mailing list