[antlr-interest] Lexer error
Mark Wright
markwright at internode.on.net
Wed Apr 14 03:08:45 PDT 2010
On Wed, Apr 14, 2010 at 04:48:51PM +0800, Brian Catlin wrote:
> Placing the Fragment attribute on FILE_NAME was just the last in a long
> series of desperate attempts to try and get it to work. I too, am surprised
> that ANTLR didn't at least warn about it.
>
>
>
> Thanks for the advice about memoization and backtracking.
>
>
>
> I modified FILE_NAME to add the quotes, as you suggested, but that didn't
> help:
>
>
>
> FILE_NAME
>
> : '"' ~('|' | '<' | '>' | '*' | '?' | '\r' | '\n' | '"')+ '"';
>
>
>
> Do you have any recommendations on examples that use semantic predicates in
> a way that is similar to what I'm trying to do?
Yes, p. 287 section Keyords as Variables of The Definitive ANTLR Reference.
Regards, Mark
> Thanks!
>
> -Brian
>
>
>
> From: Cliff Hudson [mailto:cliff.s.hudson at gmail.com]
> Sent: Wednesday, April 14, 2010 16:19
> To: BrianC at sannas.org
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Lexer error
>
>
>
> FILE_NAME is a fragment, so it will never match as a token without another
> token referring to it.. Rule a_file thus can never match (and in fact it
> seems like you should get an error about that.)
>
>
>
> You will have a more general problem that FILE_NAME can also match any of
> your keywords, and likewise your keywords can match any filename that has
> the same text, which means certain filenames will not produce the expected
> tokens in your grammar. Tokens without wildcards match in the order they
> are declared, but tokens with wildcards can consume input before preceding
> tokens that don't have wildcards which could also match the same input.
>
>
>
> There are a couple of ways around this:
>
> 1. Teach your lexer more about the input using semantic predicates - these
> allow you to switch token rules on an off depending on conditions you set.
>
> 2. Ensure your tokens are lexically unambiguous - for instance FILE_NAME
> could be surrounded by quotation marks which none of your other tokens use.
> This option is probably more desirable, since file names can also contain
> whitespace, and depending on how your grammar turns out, this would allow
> you to continue to match tokens after the file name.
>
> One note - ANTLR does not perform case-insensitive tokenization. You've
> probably already come across this, but I just wanted to make sure you knew
> before you hit that too.
>
>
>
> Finally, be sure to turn off backtracking and memoization periodically to
> see if your grammar will function without them. These do incur
> performance/memory penalties, and most grammars can be written without
> invoking these features.
>
>
>
> On Wed, Apr 14, 2010 at 12:57 AM, Brian Catlin <BrianC at sannas.org> wrote:
>
> The following grammar compiles without any sort of warnings or errors, and
> ANTLRworks doesn't complain either, but when I call the parser, it returns a
> warning for each character in the string to be parsed. I know it has
> something to do with the FILE_NAME rule, but I don't know how to fix it. I
> suspect that the lexer cannot create a token because the FILE_NAME rule
> could also match any other token (a file name on Windows can contain just
> about any character). I've structured my grammar so that the FILE_NAME is
> always the last token on a line, so I figured ANTLR would be able to figure
> it out from that context, but that doesn't appear to be the case. So, how
> can I describe it to ANTLR?
>
>
>
> Any help would be greatly appreciated!
>
>
>
> -Brian
>
>
>
>
>
> DT> dump mbr
>
> -memory-(1) : lexer error 3 :
>
> at offset 0, near 'D' :
>
> dump mbr
>
> -memory-(1) : lexer error 3 :
>
> at offset 1, near 'U' :
>
> ump mbr
>
> -memory-(1) : lexer error 3 :
>
> at offset 2, near 'M' :
>
> mp mbr
>
> -memory-(1) : lexer error 3 :
>
> at offset 3, near 'P' :
>
> p mbr
>
> -memory-(1) : lexer error 3 :
>
> at offset 5, near 'M' :
>
> mbr
>
> -memory-(1) : lexer error 3 :
>
> at offset 6, near 'B' :
>
> br
>
> -memory-(1) : lexer error 3 :
>
> at offset 7, near 'R' :
>
> r
>
>
>
> //
>
> // This grammar defines the commands available to the DiskTool (DT) program
>
> //
>
>
>
> grammar Commands;
>
>
>
> options
>
> {
>
> output = AST;
>
> ASTLabelType = pANTLR3_BASE_TREE;
>
> language = C;
>
> backtrack = true;
>
> memoize = true;
>
> }
>
>
>
> @lexer::header
>
> {
>
> #define ANTLR3_INLINE_INPUT_ASCII
>
> }
>
>
>
> //+
>
> // Productions
>
> //-
>
>
>
> commands
>
> :
>
> (script_command
>
> | dump_command
>
> | show_command
>
> )*;
>
>
>
> script_command
>
> : '@'
>
> FILE_NAME
>
> ;
>
>
>
> dump_command
>
> : DUMP
>
> ( dump_struct
>
> | dump_block
>
> | a_file
>
> );
>
>
>
> show_command
>
> : SHOW
>
> ( structure_nouns
>
> | storage_nouns
>
> | a_file
>
> );
>
>
>
> mbr_vbr
>
> : MBR
>
> | VBR
>
> ;
>
>
>
> block_nouns
>
> : LBN
>
> | LCN
>
> | VBN
>
> | VCN
>
> ;
>
>
>
> structure_nouns
>
> : MBR
>
> | VBR
>
> ;
>
>
>
> dump_block
>
>
>
> : block_nouns
>
> number
>
> (
>
> (',' number
>
> )
>
> |
>
> (':' number
>
> ))?
>
> DRIVE_NAME?
>
> ;
>
>
>
> dump_struct
>
> : mbr_vbr
>
> ('/' qualifier)?
>
> DRIVE_NAME?
>
> ;
>
>
>
> storage_nouns
>
> : DISK
>
> | VOLUME
>
> ;
>
>
>
> a_file
>
> : FILE
>
> FILE_NAME
>
> ;
>
>
>
> number
>
> : DEC_NUMBER
>
> | HEX_NUMBER
>
> ;
>
>
>
> qualifier
>
> : ALL
>
> | CODE
>
> | TABLE
>
> ;
>
>
>
> //+
>
> // Tokens
>
> //-
>
>
>
> // Verbs
>
>
>
> DUMP : 'DUMP';
>
> SHOW : 'SHOW';
>
>
>
> // Nouns
>
>
>
> DISK : 'DISK';
>
> FILE : 'FILE';
>
> LBN : 'LBN';
>
> LCN : 'LCN';
>
> MBR : 'MBR';
>
> PBN : 'PBN';
>
> VBN : 'VBN';
>
> VBR : 'VBR';
>
> VCN : 'VCN';
>
> VOLUME : 'VOLUME';
>
>
>
> // Qualifiers
>
>
>
> ALL : 'ALL';
>
> CODE : 'CODE';
>
> TABLE : 'TABLE';
>
>
>
> // Miscellaneous tokens
>
>
>
> DRIVE_NAME
>
> : LETTER ':';
>
>
>
> fragment
>
> LETTER : 'A'..'Z';
>
>
>
> fragment
>
> DIGIT : '0'..'9';
>
>
>
> fragment
>
> HEX_DIGIT : (DIGIT | 'A'..'F');
>
>
>
> HEX_NUMBER : '0X' HEX_DIGIT+;
>
>
>
> DEC_NUMBER : DIGIT+;
>
>
>
> fragment
>
> FILE_NAME
>
> : ~('|' | '<' | '>' | '*' | '?' | '\r' | '\n')+ (('\r'? '\n') | EOF);
>
>
>
> LINE_COMMENT
>
> : '!' ~('\n'|'\r')* (('\r'? '\n') | EOF) {$channel=HIDDEN;};
>
>
>
> WS : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
>
>
>
>
>
>
>
> #include <windows.h>
>
> #include <stdio.h>
>
>
>
> #include "CommandsLexer.h" //
> Generated by ANTLR from Commands.g
>
> #include "CommandsParser.h" //
> Generated by ANTLR from Commands.g
>
>
>
>
>
>
>
> void main (int Argc, char* Argv[])
>
> {
>
> DWORD status;
>
> char* ptr;
>
> char command [1024];
>
> DWORD command_len;
>
> pANTLR3_INPUT_STREAM input;
>
> pANTLR3_COMMON_TOKEN_STREAM tstream;
>
> pCommandsLexer lexer;
>
> pCommandsParser parser;
>
> CommandsParser_commands_return commands_ast;
>
> pANTLR3_COMMON_TREE_NODE_STREAM nodes;
>
> //pCommandsDumpDecl tree_parser;
>
>
>
>
>
> //+
>
> // Display our prompt and read a command string from the console
>
> //-
>
>
>
> while (TRUE)
>
> {
>
> printf ("DT> ");
>
>
>
> //+
>
> // Read the entire line
>
> //-
>
>
>
> if ((ptr = gets_s ((char *)command, sizeof (command))) != NULL)
>
> {
>
> command_len = strlen ((char*)command);
>
>
>
> //+
>
> // Only try to parse the input if there is something there
>
> //-
>
>
>
> if (command_len > 0)
>
> {
>
>
>
> //+
>
> // Create the input stream
>
> //-
>
>
>
> if ((input = antlr3NewAsciiStringInPlaceStream
> ((pANTLR3_UINT8)&command, (ANTLR3_UINT64) command_len, NULL)) != 0)
>
> {
>
>
>
> //+
>
> // Tell ANTLR to use upper-case when matching
> tokens
>
> //-
>
>
>
> input->setUcaseLA (input, ANTLR3_TRUE);
>
>
>
> //+
>
> // Create a new instance of the lexer using
> our input stream
>
> //-
>
>
>
> if ((lexer = CommandsLexerNew (input)) != 0)
>
> {
>
>
>
> //+
>
> // Create the token stream
>
> //-
>
>
>
> if ((tstream =
> antlr3CommonTokenStreamSourceNew (ANTLR3_SIZE_HINT, TOKENSOURCE(lexer))) !=
> 0)
>
> {
>
>
>
> //+
>
> // Create a new instance of the
> parser using our lexer
>
> //-
>
>
>
> if ((parser = CommandsParserNew
> (tstream)) != 0)
>
> {
>
>
>
> //+
>
> // Call the parser with the
> start symbol
>
> //-
>
>
>
> commands_ast =
> parser->commands (parser);
>
>
>
> //+
>
> // Check for errors parsing
> the input
>
> //-
>
>
>
> if
> (parser->pParser->rec->state->errorCount == 0)
>
> {
>
>
>
> //+
>
> // The input was
> parsed successfully. Use the Abstract Syntax Tree
>
> // which contains a
> linked list of nodes containing the tokens that
>
> // were parsed
>
> //-
>
>
>
> nodes =
> antlr3CommonTreeNodeStreamNewTree (commands_ast.tree, ANTLR3_SIZE_HINT);
>
> printf ("Commands
> tree: %s\n", commands_ast.tree->toStringTree (commands_ast.tree)->chars);
>
> // tree_parser =
> CommandsDumpDeclNew (nodes);
>
>
>
> // tree_parser->decl
> (tree_parser);
>
> // nodes->free (nodes);
>
> // tree_parser->free
> (tree_parser);
>
> }
>
> else
>
> {
>
> printf ("Errors found
> during parsing: %d\n", parser->pParser->rec->state->errorCount);
>
> }
>
>
>
> //+
>
> // We're now done with these
> instances, so free them
>
> //-
>
>
>
> parser->free (parser);
>
> tstream->free (tstream);
>
> lexer->free (lexer);
>
> input->close (input);
>
> }
>
> else
>
> {
>
> status = GetLastError ();
>
> printf ("Error creating
> parser, status = %08x\n", status);
>
> break;
>
> }
>
>
>
> }
>
> else
>
> {
>
> status = GetLastError ();
>
> printf ("Unable to create token
> stream, status = %08x\n", status);
>
> break;
>
> }
>
>
>
> }
>
> else
>
> {
>
> status = GetLastError ();
>
> printf ("Unable to create lexer, status
> = %08x\n", status);
>
> break;
>
> }
>
>
>
> }
>
> else
>
> {
>
> status = GetLastError ();
>
> printf ("Error creating the input stream,
> status = %08x\n", status);
>
> break;
>
> }
>
>
>
> }
>
>
>
> }
>
>
>
>
>
> } // End while
>
>
>
> }
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
More information about the antlr-interest
mailing list