[antlr-interest] Newbie! how can I convert a list of bullets to anHTML list

Tue Jun 7 02:25:04 PDT 2005

Thanks very much, Matthew (Ford). 

This Antlr is dangerously addictive, isn't it? I also spent too much
time at the weekend, trying out different ideas. I got something working
using synctatic predicates, as you suggested before, in the parser, and
by reorganizing the lexer token rules. 

The lexer rules are somewhat simpler, now that I realise I can use a
newline as a delimiter while still consuming superfluous newlines. 

My parser rules are still a little complex, with the synctatic
predicates. (I also tried semantic predicates, but it seems they only
raise exceptions, but don't help determine the next match). 

Thank you for your example. It was an epiphany for me to see how you can
repeat a subrule (para)*, while still optionally match a list (para
(list)+), without getting non-determinism. That was what I was aiming
for at first, with para (list)?, but I hadn't thought about using the |.

I will go over my grammar again, and see if applying this pattern
simplifies my parser.

I wonder whether you have any suggestion about how to nest the list
items, input in the following manner:

1. 	Lorem ipsum
1.1. 	ipsum dolor
1.2.	sit consectetuer
2.	Ipsum sit dolor
3.	Dolr sit ipsum

At the moment, I only have 1 level of nesting allowed, because I have to
explicitly define in the lexer:

NESTED_NUMBERED_LIST
	:
	(('1'..'9'!) ('0'..'9'!)? ('.'!|'\t'!|' '!)) (('1'..'9'!)
('0'..'9'!)? ('.'!|'\t'!|' '!)) LINE
	;
NUMBERED_LIST
	:
	(('1'..'9'!) ('0'..'9'!)? ('.'!|'\t'!|' '!)) LINE
	;
LINE
	:
	(~('\r' | '\n'))+ ('\r' '\n')? { newline(); }  // DOS FILE
 	;

The associated parser rules are rather complicated, although they do
work.

But, I would like to have something that handles arbitrary levels of
nesting. Here is one idea I had, of incrementing a counter in the token,
but then, not entirely sure how to handle this in the parser. 

NUMBERED_LIST
{ ListToken = new ListToken(); /*overrides Common Token */ }
	:
	NUMBERED_LIST_START   LINE | NUMBERED_LIST_START
(NUMBERED_LIST_START { token.increment(); )+  LINE { $setToken(t); }
	;

NUMBERED_LIST_START
	:
	(('1'..'9'!) ('0'..'9'!)? ('.'!|'\t'!|' '!))
	;

Can you perhaps suggest an alternative approach?

Regards,
Matthew

-----Original Message-----
From: Matthew Ford <matthew.ford at forward.com.au>
To: Matthew Pearce <mpearce at digitas.com>; antlr-interest at antlr.org
<antlr-interest at antlr.org>
Sent: Sat Jun 04 01:31:18 2005
Subject: Re: [antlr-interest] Newbie! how can I convert a list of
bullets to anHTML list

Hi Matthew  (Pearce)
here is a first pass using Antlr V3.0
Note: Antlr V3.0 is different (and better) than V2.0 so perhaps this is
not actually of much use to you.

Also not whitespace has been lost in the list items.
(Ter is there a simple way around this commom problem?)
matthew

Input

- Lorem ipsum
- Dolor sit
- Amet

- Foo bar
- Bar foo
- Foo

Output

<ol>
<li>Loremipsum</li>
<li>Dolorsit</li>
<li>Amet</li>
</ol>
<ol>
<li>Foobar</li>
<li>Barfoo</li>
<li>Foo</li>
</ol>

GRAMMAR ==============
grammar Lists;

start
  : (paraOrList)*
  ;

paraOrList
  : para
  | para {System.out.println("<ol>");} (list)+
{System.out.println("</ol>");}
  ;

list
  : {System.out.print("<li>");} MINUS (w=WORD
{System.out.print(w.getText());} )* 
    {System.out.println("</li>");} NL
  ;

para
  : NL NL
  ;

WORD  :   ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

MINUS :
    '-'
    ;

NL  : '\n'
    ;

WS  :   (   ' '
        |   '\t'
        |   '\r'
        )+
        { channel=99; }
    ;    

----- Original Message ----- 

	From: Matthew Pearce <mailto:mpearce at digitas.com>  
	To: Matthew Ford <mailto:matthew.ford at forward.com.au>  ;
antlr-interest at antlr.org 
	Sent: Friday, June 03, 2005 11:54 PM
	Subject: RE: [antlr-interest] Newbie! how can I convert a list
of bullets to anHTML list

	Matthew,

	Thanks for your reply. I'll try adding a predicate, as you
suggest. I actually don't have any problem finding a list in the lexer.
But, I guess, in the parser, I somehow have to know that one list token
is the first or last of a sequence, which, from the docs, sounded like a
context-sensitive grammar, like:

	para list -> list_begin list_item

	list list -> list_item

	list para -> list_item list_end

	Does that make sense to you?

	A list is actually the character sequence:

	\n

	-\tLorem ipsum\n

	-\tDolor sit\n

	-\tAmet\n

	\n

	-\sFoo bar\n

	-\sBar foo\n

	-\sFoo\n

	I haven't attempted it yet, but I also need to support a char
sequence like

	\n

	1.\tLorem ipsum\n

	2.\tDolor sit\n

	2.1.\tAmet\n

	2.2.\tConsectetuer Amet\n

	making a nested HTML ordered list <ol><li><ol>Consectetuer
Amet</ol></li></ol>.

	Hence my earlier point about nested lists.

	________________________________

		From: Matthew Ford [mailto:matthew.ford at forward.com.au] 
	Sent: 02 June 2005 23:02
	To: Matthew Pearce; antlr-interest at antlr.org
	Subject: Re: [antlr-interest] Newbie! how can I convert a list
of bullets to anHTML list

	Is the list actually the character sequence

	/n

	/t-/tbullet/n

	/t-/tbullet/n

	/t-/tbullet/n

	/t-/tbullet/n

	What makes a list different from other text like /t-/t

	matthew

	You may need to do infinite lookahead to decided you are
processing a list

	 like 

	(list) => list

	see Syntactic Predicates in the docs

	matthew

		----- Original Message ----- 

		From: Matthew Pearce <mailto:mpearce at digitas.com>  

		To: antlr-interest at antlr.org 

		Sent: Friday, June 03, 2005 1:19 AM

		Subject: [antlr-interest] Newbie! how can I convert a
list of bullets to anHTML list

		I'd like to convert a list of bullets to an HTML list,
i.e.:

		From:

		-          bullet

		-          bullet

		-          bullet

		To:

		<ul><li>bullet</li><li>bullet</li><li>bullet</li></ul>

		I thought over a few different options:

		1. Have the lexer produce a LIST token when it matches:

		 - bullet

		But I don't know how to get the parser to find the <ul>
tags, because I cannot add a special case

		2. Have the lexer produce a LIST token when it matches:

		-          bullet

		-          bullet

		-          bullet

		But I don't know how to get the parser to insert the
<li> tags, because it hasn't tokenized each bullet

		3. Have the parser match a rule for list that matches
like:

		list:       LIST^  PARA (LIST! PARA)+

		Which would give me an AST node like, that could support
nested lists.

		                        LIST ----+----PARA

		                                    +----PARA

+----LIST--------+-PARA

		                                     +---PARA         

		But this gives me non-determinisim, between match a
straight paragraph (PARA), and a bulleted line LIST PARA.

		Can anyone suggest an approach?  

		class CourseTreeWalker extends TreeParser;

		tree2html returns [String s]

		{ s = ""; }

		    :

		      (#(t:TTL (p:PARA | l:list)+ { 

		            s+="<h4>" +t+ "</h4>\n";

		            s+= "<p>" +p+ "</p>\n";

		            s+= "<ul>"+l+"</ul>"; } ))+   // this
doesn't do what I want

		    ;

		list        // this doesn't do what I want

		{ String l = ""; }

		 :

		      (#(LIST (p2:PARA) { 

		            l+="<ul><li>" +p2+ "</li></ul>\n";

		             } ))

		;

		class CourseParser extends Parser;

		options {

		    buildAST = true;

		}

		file :  (section)+ EOF! ;

		section : TTL^ (listexpr)+;

		listexpr : (LIST^)? paraexpr;   // this just matches
each bullet, instead of treating bullets as a group

		paraexpr: (PARA);

		class CourseLexer extends Lexer;

		options {

		    k = 3; 

		    charVocabulary = '\3'..'\377';

		}

		PARA  : ("LZU") =>

		        ("LZU" (LETTER | DIGIT | ' ' | '/')+)  {
$setType(TTL); }

		        |

		        ("Des") =>

		        ("Description:")   { $setType(TTL); }

		        |

		        ("Lea") =>

		        ("Learning objectives:")   { $setType(TTL); }

		        |

		        ("Tar") =>

		        ("Target audience:")   { $setType(TTL); }

		        |

		        ("Pre") =>

		        ("Prerequisites:")   { $setType(TTL); }

		        |

		         (CHAR | ' ' )+ 

		      ;

		LIST   : ('-' | '*') ;

		NEWLINE : (

		                  ('\r''\n')=> '\r''\n' //DOS

		                  | '\r' //MAC 

		                  | '\n' //UNIX

		                  )

		                  { $setType(Token.SKIP); newline();  }

		            ;

		protected

		DIGIT

		      : '0'..'9'

		      ;

		protected

		LETTER

		      : ('a'..'z' | 'A'..'Z')

		      ;

		protected

		CHAR

		      : ~( '\n' | '\r' | ' ' | '\t' | '\f' | '-' | '*' )

		      ;