[antlr-interest] Lexer rule for INTEGER and COMMA_INTEGER

Tue Nov 6 15:23:38 PST 2012

A solution for v4.

Roughly 2 hours using v4, 2 days using v3.4. As you can see by comparing
with the v3.4 solution, ANTLR4 is much more powerful, writing a grammar is
simpler, the trace is more user-friendly

enter   comma_integer, LT(1)=1
consume [@59,80:80='1',<7>,1:80] rule comma_integer alt=1
exit    comma_integer, LT(1)= ,

A big quantum leap, a five stars tool, if not All*.

========== grammar

grammar Q4;

/* Recognize edited numbers like 1,234,567 as a whole but
   F(1, 2 ,3, 44,55,66) as 4 parameters, white space skipped,
   but `, ` and ` ,` are separators.
   for ANTLR v4 */

@parser::members {
    ArrayList<String> parms;
    void storeAtom(String text) {
        parms.add(text);
//        System.out.println("atom <" + text + "> has been added");
    }
}

line
@init {System.out.println("--- last update 1426");}
    : piece* EOF ;

piece
    :   comma_integer  {System.out.println("===== found a COMMA_INTEGER :
<" + $comma_integer.text + ">");}
    |   function
    ;

comma_integer
    :   INT ( COMMA INT )*
    ;

function
@init {parms = new ArrayList<String>();}
@after {System.out.println(">>>>> Function " + $function.text + " has " +
parms.size() + " parameters");
        for(int i = 0; i < parms.size(); i++) System.out.println("p" + (i +
1) + "=`" + parms.get(i) + "`");
       }

    :   ID '(' list ')'
    ;

list
    :   a=atom
 {storeAtom($a.text);}
        ( seperator b=atom  {/* storeAtom($seperator.text); */
storeAtom($b.text);}
        )*
    ;

seperator
    :   COMMA
    |   COMMA_SPACE
    |   SPACE_COMMA
    ;

atom
    :   ID
    |   comma_integer
    |   INT
    ;

COMMA_SPACE : ', ' ;
SPACE_COMMA : ' ,' ;
COMMA : ',' ;
ID  : [a-zA-Z_]+ ;
INT : DIGIT+ ;
WS  : [ \t\r\n] -> channel(HIDDEN) ;

fragment DIGIT : [0-9];

========== input

$ cat t2.comma
1,234,567 F(1, x)  G(11,   12  , 13,444)  H(99,88,77,  66,6)  P(9,
8,77,666)  X(1 , 2, 3 ,4 , 5,6     ,   7,888,999)

========== execution

$ alias
alias antlr4='java -jar /usr/local/lib/antlr-4.0b2-complete.jar'
$ antlr4 Q4.g4
$ javac Q4*.java
$ grun Q4 line -tokens t2.comma
[@0,0:0='1',<7>,1:0]
[@1,1:1=',',<5>,1:1]
[@2,2:4='234',<7>,1:2]
...
--- last update 1426
===== found a COMMA_INTEGER : <1,234,567>
>>>>> Function F(1, x) has 2 parameters
p1=`1`
p2=`x`
>>>>> Function G(11,   12  , 13,444) has 3 parameters
p1=`11`
p2=`12`
p3=`13,444`
>>>>> Function H(99,88,77,  66,6) has 2 parameters
p1=`99,88,77`
p2=`66,6`
>>>>> Function P(9, 8,77,666) has 2 parameters
p1=`9`
p2=`8,77,666`
>>>>> Function X(1 , 2, 3 ,4 , 5,6     ,   7,888,999) has 6 parameters
p1=`1`
p2=`2`
p3=`3`
p4=`4`
p5=`5,6`
p6=`7,888,999`

2012/11/3 Zhaohui Yang <yezonghui at gmail.com>

> Hi,
>
> I have a lexer grammar that that has to recognize INTEGER like 1234 and
> COMMA_INTEGER like 1,234,567
> The later integer token has comma in it, and of cause the language has
> other places that use comma, e.g. F(1, x) is valid, which contains "1,"
> that should be recognized as a INTEGER 1 followd by a comma.
> ...........
>
Yes. If there are white space before or after the comma, they are seperate
parameters; if no white spaces around, it is one COMMA_integer.

> --
> Regards,
>
> Yang, Zhaohui
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>