[antlr-interest] URGENT : HTML and SCRIPT tag
johnclarke72
johnclarke at hotmail.com
Sun Jun 23 14:35:40 PDT 2002
At the moment I am still working on a HTML Parser.
Basically I need the current Tag Definitions to work but I also need
it to process a script tag so that it keeps all of the data (with
spaces, tabs,etc) between the begin and end tags. EG :
<script>
test code
some lines of code
some more lines of code
</script>
The script tag could also contain attributes like any other tag.
I have included my Grammar for the lexer below. How can I modify
this grammar so that it can handle the script tag ?
John
// Import the required Classes
header
{
import java.util.*;
import antlr.*;
}
// Define the class
class HTMLLexer extends Lexer;
// Set the options for the Lexer
options
{
k=9; // Set the look ahead to 9
Characters
caseSensitive = false; // Set Case Sensitivity to false
charVocabulary = '\1' .. '\377'; // Set the Lexer Character
Vocabulary
testLiterals = false; // Don't test against the Literals
table
exportVocab = HTMLLexer; // The Grammar to export
}
// Text Data - This is used for Text, Tags and Attributes
TEXTDATA : (~(' ' | '\r' | '\n' | '\t' | '<' | '>' | '/' | '!' | '='
| '"' | '\''))+;
// HTML Comments
HTMLCOMMENT : "<!--"! (options {greedy=false;} : .)* "-->"!;
// Document Type Definition
HTMLDTD : "<!doctype"! (options {greedy=false;} : .)* ">"!;
//
// Main HTML Tag Section
//
STARTTAG
{
Hashtable tagAttributes = null;
TagToken returnToken = null;
}
: "<"! tagName:TEXTDATA (WS (tagAttributes = ATTRIBUTES)?)?
{
returnToken = new TagToken(tagName.getText(),tagAttributes);
$setToken(returnToken);
}
(">"!);
// Definition of an End Tag
ENDTAG : "</"! TEXTDATA ">"!;
// For processing HTML Attributes
// TAGVALUE is used to define attribute values that have quotes
protected TAGVALUE : ('"'!|'\''!) (options {greedy=false;} : ~
('"'|'\''))* ('"'!|'\''!);
// Definition for Attributes
protected ATTRIBUTES returns [Hashtable a = new Hashtable()]
: ( ATTRIBUTE[a] (WS ATTRIBUTE[a])* )
;
protected ATTRIBUTE [Hashtable h]
: key:TEXTDATA { h.put(key.getText(), ""); }
( '=' (WS)?
( v1: TEXTDATA { h.put(key.getText(), v1.getText());}
| v2: TAGVALUE { h.put(key.getText(), v2.getText());}
)
)?
;
// Ignore all White Space
WS : ( ' '
| '\t'
| '\r' '\n' { newline(); }
| '\n' { newline(); }
)
{$setType(Token.SKIP);} //ignore this token
;
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list