[antlr-interest] Newbie! how can I convert a list of bullets
to anHTML list
Matthew Pearce
mpearce at digitas.com
Tue Jun 7 02:25:04 PDT 2005
Thanks very much, Matthew (Ford).
This Antlr is dangerously addictive, isn't it? I also spent too much
time at the weekend, trying out different ideas. I got something working
using synctatic predicates, as you suggested before, in the parser, and
by reorganizing the lexer token rules.
The lexer rules are somewhat simpler, now that I realise I can use a
newline as a delimiter while still consuming superfluous newlines.
My parser rules are still a little complex, with the synctatic
predicates. (I also tried semantic predicates, but it seems they only
raise exceptions, but don't help determine the next match).
Thank you for your example. It was an epiphany for me to see how you can
repeat a subrule (para)*, while still optionally match a list (para
(list)+), without getting non-determinism. That was what I was aiming
for at first, with para (list)?, but I hadn't thought about using the |.
I will go over my grammar again, and see if applying this pattern
simplifies my parser.
I wonder whether you have any suggestion about how to nest the list
items, input in the following manner:
1. Lorem ipsum
1.1. ipsum dolor
1.2. sit consectetuer
2. Ipsum sit dolor
3. Dolr sit ipsum
At the moment, I only have 1 level of nesting allowed, because I have to
explicitly define in the lexer:
NESTED_NUMBERED_LIST
:
(('1'..'9'!) ('0'..'9'!)? ('.'!|'\t'!|' '!)) (('1'..'9'!)
('0'..'9'!)? ('.'!|'\t'!|' '!)) LINE
;
NUMBERED_LIST
:
(('1'..'9'!) ('0'..'9'!)? ('.'!|'\t'!|' '!)) LINE
;
LINE
:
(~('\r' | '\n'))+ ('\r' '\n')? { newline(); } // DOS FILE
;
The associated parser rules are rather complicated, although they do
work.
But, I would like to have something that handles arbitrary levels of
nesting. Here is one idea I had, of incrementing a counter in the token,
but then, not entirely sure how to handle this in the parser.
NUMBERED_LIST
{ ListToken = new ListToken(); /*overrides Common Token */ }
:
NUMBERED_LIST_START LINE | NUMBERED_LIST_START
(NUMBERED_LIST_START { token.increment(); )+ LINE { $setToken(t); }
;
NUMBERED_LIST_START
:
(('1'..'9'!) ('0'..'9'!)? ('.'!|'\t'!|' '!))
;
Can you perhaps suggest an alternative approach?
Regards,
Matthew
-----Original Message-----
From: Matthew Ford <matthew.ford at forward.com.au>
To: Matthew Pearce <mpearce at digitas.com>; antlr-interest at antlr.org
<antlr-interest at antlr.org>
Sent: Sat Jun 04 01:31:18 2005
Subject: Re: [antlr-interest] Newbie! how can I convert a list of
bullets to anHTML list
Hi Matthew (Pearce)
here is a first pass using Antlr V3.0
Note: Antlr V3.0 is different (and better) than V2.0 so perhaps this is
not actually of much use to you.
Also not whitespace has been lost in the list items.
(Ter is there a simple way around this commom problem?)
matthew
Input
- Lorem ipsum
- Dolor sit
- Amet
- Foo bar
- Bar foo
- Foo
Output
<ol>
<li>Loremipsum</li>
<li>Dolorsit</li>
<li>Amet</li>
</ol>
<ol>
<li>Foobar</li>
<li>Barfoo</li>
<li>Foo</li>
</ol>
GRAMMAR ==============
grammar Lists;
start
: (paraOrList)*
;
paraOrList
: para
| para {System.out.println("<ol>");} (list)+
{System.out.println("</ol>");}
;
list
: {System.out.print("<li>");} MINUS (w=WORD
{System.out.print(w.getText());} )*
{System.out.println("</li>");} NL
;
para
: NL NL
;
WORD : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
MINUS :
'-'
;
NL : '\n'
;
WS : ( ' '
| '\t'
| '\r'
)+
{ channel=99; }
;
----- Original Message -----
From: Matthew Pearce <mailto:mpearce at digitas.com>
To: Matthew Ford <mailto:matthew.ford at forward.com.au> ;
antlr-interest at antlr.org
Sent: Friday, June 03, 2005 11:54 PM
Subject: RE: [antlr-interest] Newbie! how can I convert a list
of bullets to anHTML list
Matthew,
Thanks for your reply. I'll try adding a predicate, as you
suggest. I actually don't have any problem finding a list in the lexer.
But, I guess, in the parser, I somehow have to know that one list token
is the first or last of a sequence, which, from the docs, sounded like a
context-sensitive grammar, like:
para list -> list_begin list_item
list list -> list_item
list para -> list_item list_end
Does that make sense to you?
A list is actually the character sequence:
\n
-\tLorem ipsum\n
-\tDolor sit\n
-\tAmet\n
\n
-\sFoo bar\n
-\sBar foo\n
-\sFoo\n
I haven't attempted it yet, but I also need to support a char
sequence like
\n
1.\tLorem ipsum\n
2.\tDolor sit\n
2.1.\tAmet\n
2.2.\tConsectetuer Amet\n
making a nested HTML ordered list <ol><li><ol>Consectetuer
Amet</ol></li></ol>.
Hence my earlier point about nested lists.
________________________________
From: Matthew Ford [mailto:matthew.ford at forward.com.au]
Sent: 02 June 2005 23:02
To: Matthew Pearce; antlr-interest at antlr.org
Subject: Re: [antlr-interest] Newbie! how can I convert a list
of bullets to anHTML list
Is the list actually the character sequence
/n
/t-/tbullet/n
/t-/tbullet/n
/t-/tbullet/n
/t-/tbullet/n
What makes a list different from other text like /t-/t
matthew
You may need to do infinite lookahead to decided you are
processing a list
like
(list) => list
see Syntactic Predicates in the docs
matthew
----- Original Message -----
From: Matthew Pearce <mailto:mpearce at digitas.com>
To: antlr-interest at antlr.org
Sent: Friday, June 03, 2005 1:19 AM
Subject: [antlr-interest] Newbie! how can I convert a
list of bullets to anHTML list
I'd like to convert a list of bullets to an HTML list,
i.e.:
From:
- bullet
- bullet
- bullet
To:
<ul><li>bullet</li><li>bullet</li><li>bullet</li></ul>
I thought over a few different options:
1. Have the lexer produce a LIST token when it matches:
- bullet
But I don't know how to get the parser to find the <ul>
tags, because I cannot add a special case
2. Have the lexer produce a LIST token when it matches:
- bullet
- bullet
- bullet
But I don't know how to get the parser to insert the
<li> tags, because it hasn't tokenized each bullet
3. Have the parser match a rule for list that matches
like:
list: LIST^ PARA (LIST! PARA)+
Which would give me an AST node like, that could support
nested lists.
LIST ----+----PARA
+----PARA
+----LIST--------+-PARA
+---PARA
But this gives me non-determinisim, between match a
straight paragraph (PARA), and a bulleted line LIST PARA.
Can anyone suggest an approach?
class CourseTreeWalker extends TreeParser;
tree2html returns [String s]
{ s = ""; }
:
(#(t:TTL (p:PARA | l:list)+ {
s+="<h4>" +t+ "</h4>\n";
s+= "<p>" +p+ "</p>\n";
s+= "<ul>"+l+"</ul>"; } ))+ // this
doesn't do what I want
;
list // this doesn't do what I want
{ String l = ""; }
:
(#(LIST (p2:PARA) {
l+="<ul><li>" +p2+ "</li></ul>\n";
} ))
;
class CourseParser extends Parser;
options {
buildAST = true;
}
file : (section)+ EOF! ;
section : TTL^ (listexpr)+;
listexpr : (LIST^)? paraexpr; // this just matches
each bullet, instead of treating bullets as a group
paraexpr: (PARA);
class CourseLexer extends Lexer;
options {
k = 3;
charVocabulary = '\3'..'\377';
}
PARA : ("LZU") =>
("LZU" (LETTER | DIGIT | ' ' | '/')+) {
$setType(TTL); }
|
("Des") =>
("Description:") { $setType(TTL); }
|
("Lea") =>
("Learning objectives:") { $setType(TTL); }
|
("Tar") =>
("Target audience:") { $setType(TTL); }
|
("Pre") =>
("Prerequisites:") { $setType(TTL); }
|
(CHAR | ' ' )+
;
LIST : ('-' | '*') ;
NEWLINE : (
('\r''\n')=> '\r''\n' //DOS
| '\r' //MAC
| '\n' //UNIX
)
{ $setType(Token.SKIP); newline(); }
;
protected
DIGIT
: '0'..'9'
;
protected
LETTER
: ('a'..'z' | 'A'..'Z')
;
protected
CHAR
: ~( '\n' | '\r' | ' ' | '\t' | '\f' | '-' | '*' )
;
More information about the antlr-interest
mailing list