[antlr-interest] Help needed with baby LaTeX parser

Pavel Grinfeld pg at freeboundaries.com
Tue Jun 22 13:47:52 PDT 2010


Hi,

I'm doing pretty well recognizing LaTeX commands, but now I'm at the 
stage where I want to capture the "text". I'm having trouble defining 
"everything else".
Basically, I currently define LaTeX as

commands (as I define them), possibly separated by WS, and everything 
that's not a command is "text". I keep running into a problem that when 
I define "text" generously, it starts grabbing tokens that belong to 
commands. Any help would be greatly appreciated!

Thanks in advance,

Pavel

  I'm including what I have so far, and the document I'm hoping to parse.

grammar PGTeX;

doc : (command WS?)+ EOF;

command : escWord  cWord+ ( sWord+ cWord*)?;

sWord    : '[' word ']';
cWord    : '{' word '}';
escWord : '\\' word;

word : WORD;

WORD:    ('-'|'a'..'z'|'A'..'Z'|'0'..'9'|'\*')+;

WS  :   ( ' ' | '\t'| '\r' | '\n' )+;

COMMENT
     :    '%' (~('\n'|'\r'))*  {$channel = HIDDEN;};


And here's the document:

\documentclass{book}%
\usepackage{amsfonts}
\usepackage{amsmath}%
\newtheorem{summary}[theorem]{Summary}
\begin{document}


\chapter*{Intro}

Book starts here $x^{2}+y^{2}=1$. Here's an intersting faction:
\begin{equation}
\int_{0}^{1}\sin xdx=4
\end{equation}

\end{document}





More information about the antlr-interest mailing list