[antlr-interest] complex lexer (at least to me)
Stanislas Rusinsky
rusinskystanislas at yahoo.fr
Fri Oct 15 04:19:19 PDT 2010
Hi list,
while doing a parser I ran into the trouble of lexing correctly comments and
non-comments that look like comments.
Comments starts with a '#' and ends at newline, they should be hidden.
BUT '#!something' is an ID
and ':#header' has its meaning too
I've tried several ways which never worked enough, synpreds, ...
This one eats everything in the last option:
COLUMN_NAMES_END
: HASH HEADER {System.out.println(" ^~^ LEXER: found HEADER_COMMENT: " +
$text); };
DBT_UNIT_NAME_START
: HASH BANG {System.out.println(" ^~^ LEXER: found DBT_UNIT_NAME_START: " +
$text); };
LINE_COMMENT_OR_ELSE
: ( HASH BANG ) => DBT_UNIT_NAME_START{ $type = DBT_UNIT_NAME_START;
System.out.println(" ^~^ LEXER: found HASH BANG: DBT_UNIT_NAME_START: " +
$text); }
| ( HASH HEADER ) => COLUMN_NAMES_END { $type = COLUMN_NAMES_END;
System.out.println(" ^~^ LEXER: found HASH HEADER: COLUMN_NAMES_END: " + $text);
}
| ( HASH (options {greedy=false;} : .)* NEWLINE ) => COMMENT
{System.out.println(" ^~^ LEXER: LINE_COMMENT Ignoring LINE comment: " + $text);
}
;
protected
COMMENT : HASH (options {greedy=false;} : .)* NEWLINE
{$channel=HIDDEN; System.out.println(" ^~^ LEXER: COMMENT: Ignoring LINE
comment: " + $text); }
;
So every '#' line ends up caught by COMMENT and I get this unique error message
on grammar generation:
[java] error(208): JADATextGrammar.g:98:1: The following token definitions
can never be matched because prior tokens match the same input: COMMENT
Any ideas??
Stanislas Herman Rusinsky.
P.S.: From the article "What makes a language problem hard?" (
http://www.antlr.org/wiki/pages/viewpage.action?pageId=1773 )it looks like I
meet those:
* Context sensitive lexer? You can't decide what vocabulay symbol to match
unless you know what kind of sentence you are parsing.
* Is the set of all input fixed? If you have a fixed set of files to convert,
your job is much easier because the set of language construct combinations is
fixed. For example, building a general Pascal to Java translator is much harder
than building a translator for a set of 50 existing Pascal files.
* Are delimiters non-fixed for things like strings and comments? That makes it
tough to build an efficient lexer.
* Are the source statements really similar; declarations vs expressions in C++?
* Column sensitive input? E.g., are newlines significant like lines in a log
file and does the position of an item change its meaning?
* Does your input have comments as you do in programming languages that can
occur anywhere in the input and need to go into the output in a sane location?
* How much semantic information do you need to do the translation? For example,
do you need to simply know that something is a type name or do you need to know
that it is, say, an array whose indices are a set like (day,week,month) and
contains records? Sometimes syntax alone is enough to do translation.
* ...
More information about the antlr-interest
mailing list