[antlr-interest] same char but different context

Sat Nov 28 10:38:20 PST 2009

Hard to tell what the format is from this, but presumably each new single character type introducer is the first non-whitespace after a newline. If this is the case then you need to take the lexer tokens out of the tokens section and create real LEXER rules that have a predicate based on a Boolean switch, which is set to true after seeing a newline and set to false after seeing the single character. Then you don't want an ANY rule, you want a rule that consumes to end of line. So you want something like this:

grammar T;

 options {
         output = AST;
}

@lexer::members {
   boolean isType = true;
}

 start       :   header record+  EOF;

 header      :   KEYWORD_TYPE RECORD ;

 record      :   item+ END_OF_RECORD;

 item        :   item_type RECORD ;

 item_type   :   (TYPE_DATE
                 |TYPE_AMOUNT
                 |TYPE_MEMO
                 |TYPE_CLEARED
                 |TYPE_CHECK_NUMBER
                 |TYPE_PAYEE
                 |TYPE_PAYEE_ADDRESS
                 |TYPE_CATEGORY
                 |TYPE_REIMBURSE
                 |TYPE_SPLIT_CATEGORY
                 |TYPE_SPLIT_MEMO
                 |TYPE_SPLIT_AMOUNT
                 |TYPE_SPLIT_PERCENTAGE
                 |TYPE_SECURITY_NAME
                 |TYPE_PRICE
                 |TYPE_SHARE_QUANTITY
                 |TYPE_COMMISSION_COSTS
                 );

KEYWORD_TYPE            : ('!Type:')=>'!Type:'         { isType=false;  };
END_OF_RECORD           : '^';
TYPE_DATE               : {isType}?=>  'D' { isType=false; };
TYPE_AMOUNT             : {isType}?=>  'T' { isType=false; };
TYPE_MEMO               : {isType}?=>  'M' { isType=false; };
TYPE_CLEARED            : {isType}?=>  'C' { isType=false; };
TYPE_CHECK_NUMBER       : {isType}?=>  'N' { isType=false; };
TYPE_PAYEE              : {isType}?=>  'P' { isType=false; };
TYPE_PAYEE_ADDRESS      : {isType}?=>  'A' { isType=false; };
TYPE_CATEGORY           : {isType}?=>  'L' { isType=false; };
TYPE_REIMBURSE          : {isType}?=>  'F' { isType=false; };
TYPE_SPLIT_CATEGORY     : {isType}?=>  'S' { isType=false; };
TYPE_SPLIT_MEMO         : {isType}?=>  'E' { isType=false; };
TYPE_SPLIT_AMOUNT       : {isType}?=>  '$' { isType=false; };
TYPE_SPLIT_PERCENTAGE   : {isType}?=>  '%' { isType=false; };
TYPE_SECURITY_NAME      : {isType}?=>  'Y' { isType=false; };
TYPE_PRICE              : {isType}?=>  'I' { isType=false; };
TYPE_SHARE_QUANTITY     : {isType}?=>  'Q' { isType=false; };
TYPE_COMMISSION_COSTS   : {isType}?=>  'O' { isType=false; };

fragment NLCHARS        : '\r'|'\n';
NEWLINE                 : ('\r'? '\n')+ { isType=true; $channel=99; };
RECORD                  : {!isType}?=>(~NLCHARS)+ ;

This only works if NEWLINE is the end of one record, signifying the start of another. To be honest, this is so simple that a simple program to scan it and build it all at once may be simpler and better for you - it looks like the record format was designed for a simple scanner. Note that your example uses the command 'H', which is not in your command set, that I have assumed you end of record is on a new line of its own (if not then the record token also needs to exclude '^' in its set). Also note that this is just my best guess from trying to interpolate from the grammar you posted.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of codeman at bytefusion.de
> Sent: Saturday, November 28, 2009 1:08 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] same char but different context
> 
> Given is a record-per-line format like this:
> 
> <single-char><sequence-of-chars><crlf>
> 
> <single-char> => single letter
> <sequence-of-chars> => any except end-of-line
> <crlf> => end of line
> 
> My problem is the following:
> 
> WHello World
> 
> "W" => recognized as single char
> "Hello " is broken, W seems to be a new start char
> 
> Here is my grammer. Aimed target is to parse a quicken interchange
> format file. Any ideas?
> 
> 
> grammar myExample;
> 
> options {
>         output=AST;
> }
> 
> tokens {
> TYPE_DATE               =   'D';
> TYPE_AMOUNT             =   'T';
> TYPE_MEMO               =   'M';
> TYPE_CLEARED            =   'C';
> TYPE_CHECK_NUMBER       =   'N';
> TYPE_PAYEE              =   'P';
> TYPE_PAYEE_ADDRESS      =   'A';
> TYPE_CATEGORY           =   'L';
> TYPE_REIMBURSE          =   'F';
> TYPE_SPLIT_CATEGORY     =   'S';
> TYPE_SPLIT_MEMO         =   'E';
> TYPE_SPLIT_AMOUNT       =   '$';
> TYPE_SPLIT_PERCENTAGE   =   '%';
> TYPE_SECURITY_NAME      =   'Y';
> TYPE_PRICE              =   'I';
> TYPE_SHARE_QUANTITY     =   'Q';
> TYPE_COMMISSION_COSTS   =   'O';
> 
> }
> 
> start       :   header record+ NEWLINE* EOF;
> 
> header      :   KEYWORD_TYPE description NEWLINE;
> 
> description :   ANY+;
> 
> record      :   item+ END_OF_RECORD;
> 
> item        :   item_type description NEWLINE;
> 
> item_type   :   (TYPE_DATE
>                 |TYPE_AMOUNT
>                 |TYPE_MEMO
>                 |TYPE_CLEARED
>                 |TYPE_CHECK_NUMBER
>                 |TYPE_PAYEE
>                 |TYPE_PAYEE_ADDRESS
>                 |TYPE_CATEGORY
>                 |TYPE_REIMBURSE
>                 |TYPE_SPLIT_CATEGORY
>                 |TYPE_SPLIT_MEMO
>                 |TYPE_SPLIT_AMOUNT
>                 |TYPE_SPLIT_PERCENTAGE
>                 |TYPE_SECURITY_NAME
>                 |TYPE_PRICE
>                 |TYPE_SHARE_QUANTITY
>                 |TYPE_COMMISSION_COSTS
>                 );
> 
> 
> KEYWORD_TYPE            :   '!Type:';
> NEWLINE                 :   ('\r'|'\n'|'\r\n');
> END_OF_RECORD           :   '^';
> ANY                     :   ~(NEWLINE);
> 
> 
> 
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address