[antlr-interest] Reposting an email about a baby latex parser

Pavel Grinfeld pg at freeboundaries.com
Thu Jun 24 11:27:28 PDT 2010


Hi, I hope it's OK to resend an email that was overlooked previously...

My problem is separating text from commands in LaTeX. I'm doing pretty 
well recognizing LaTeX commands, but now I'm at the stage where I want 
to capture the "text". I'm having trouble defining "everything else".

Basically, I currently define LaTeX as

commands (as I define them), possibly separated by WS, and everything 
that's not a command is "text". I keep running into a problem that when 
I define "text" generously, it starts grabbing tokens that belong to 
commands. Any help would be greatly appreciated!

Thanks in advance,

Pavel

  I'm including what I have so far, and the document I'm hoping to parse.

grammar PGTeX;

doc : (command WS?)+ EOF;

command : escWord  cWord+ ( sWord+ cWord*)?;

sWord    : '[' word ']';
cWord    : '{' word '}';
escWord : '\\' word;

word : WORD;

WORD:    ('-'|'a'..'z'|'A'..'Z'|'0'..'9'|'\*')+;

WS  :   ( ' ' | '\t'| '\r' | '\n' )+;

COMMENT
     :    '%' (~('\n'|'\r'))*  {$channel = HIDDEN;};


And here's the document:

\documentclass{book}%
\usepackage{amsfonts}
\usepackage{amsmath}%
\newtheorem{summary}[theorem]{Summary}
\begin{document}


\chapter*{Intro}

Book starts here $x^{2}+y^{2}=1$. Here's an intersting faction:
\begin{equation}
\int_{0}^{1}\sin xdx=4
\end{equation}

\end{document}






More information about the antlr-interest mailing list