[antlr-interest] Parsing names

Mon Mar 17 21:42:41 PDT 2008

Alan,

Terence has some links on the antlr website, and I think some of the 
examples include building a rudimentary symbol table. But for purposes 
of this language, you don't need very much. You DO need to smarten up 
the lexer, because spaces are allowed in the names.

First, parse the "Seat 7: Pretty Pam 10 (3,355)" line.

That looks to me like a simple regular expression. Everything after ": " 
(colon, space) up to " (" (space, paren) is a name.

seat_spec: 'Seat' Num PlayerNameDef '(' CommaNum ')' ;

Writing the PlayerNameDef lexer target will be a little bit challenging 
because you have to look ahead for the terminator. An easier approach 
might be to make a single token, called "SEAT_INFO" or something, that 
gobbles up the entire line. Then you could parse the player name out by 
hand.

After all the Seat 1..n lines, all player names appear to start at the 
beginning of the line. Create a second lexer target "PlayerName" that 
matches. The trick is to compare with an array or tree of player names 
you captured in memory from the seat_spec lines.

Create an array of strings (or objects -- make this more complex when 
you're comfortable), player_names[] = { "miannie", "LATUK", "stigs2", 
"PeggySue07", ... "Pretty Pam 10", ... };

Then as you are lexing the PlayerName, defer adding a character until 
you are sure it matches one or more possible names in the list. 
Otherwise, either report an error or end the token. (A space ends the 
token, a non-space probably indicates an error.) You might do this at 
the parser level, too--grab a bunch of "player words" and append them. 
But you'll have to append them character-by-character, so that doesn't 
buy you much.

Slightly off topic, here's a hinky lexer rule I built that knows more 
than it should about the innards of how the lexer works. The 
"input.LA(1)" stuff is me looking ahead at the next character. The 
"matchAny" routine is a lexer built-in that does what it sounds like. 
The other mFoo() stuff are various patterns I was recursing into, 
specific to my grammar. You can see that I'm basically looping, gobbling 
one or more character at a time.

==========

fragment
NestedCodeBlock
    : '{'
    {
	loopNCB:
	do
	{
	    int next_char = input.LA(1);

	    switch (next_char)
	    {
	    case '}':
		break loopNCB;

	    case '"': mQlit_Double(); break;
	    case '\'': mQlit_Single(); break;
	    case '\u00AB': mQlit_Willies(); break;
	    case '<':
		if (input.LA(2) == '<')
		    mQlit_Angles();
		else
		    matchAny();
		break;

	    case '/':
		switch (input.LA(2))
		{
		case '/': mSingleLineComment(); break;
		case '*': mMultiLineComment(); break;
		default: matchAny(); break;
		}
		break;

	    case '{': mNestedCodeBlock(); break;
	    default: matchAny(); break;
	    }

	    if (failed) return;
	}
	while (true);
	match('}');
	if (!failed) return;
    }
    .* '}'
    ;

==========

Note that the last line but one, ".* '}'" is there to confuse ANTLR. If 
you don't confuse it, it knows too much and screws up my code. (Some of 
the "if failed return" stuff is there for the same reason. Freaking Java 
won't let you keep "dead code" in your methods.)

You can probably get away with a loop that looks like

top-of-loop:
get-next-char
append-next-char-to-buffer
for (all names in seat list)
do
    if (buffer equals name.substring(buffer.length))
        accept this character, continue loop
done

if (character is space)
    accept buffer, keep space as next input
else
    reject this name-plus-extra-letter as bogus ("Pretty Pam 10[ ]" is 
okay. "Pretty Pam 10[0]" is not okay.)

go to top of loop

=Austin

alan brown wrote:
> Happy to provide example text...
>
> The following is a hand being played with "Pretty Pam 10".  My lexer 
> creates tokens that are mostly words and numbers and my parser is 
> being forced to do major look ahead to work out what each line is 
> trying to convey (because things like 'PeggySue has 15 seconds to 
> act'  or 'johnvfardella is feeling happy' (among others) can appear 
> almost anywhere).
>
> How would I implement a symbol table dynamically.  Can you point me to 
> an example or some documentation?  I don't see it in the book.  I 
> haven't created a symbol table before.
>
> In the example below I'd like to make the 10 players names first class 
> citizens (ie single tokens).
>
> alan
>
> Game #5678328259: Table Play Chip 798 - 10/20 - Limit Hold'em - 
> 22:09:21 ET - 2008/03/17
> Seat 1: miannie (949)
> Seat 2: LATUK (320)
> Seat 3: stigs2 (1,110)
> Seat 4: PeggySue07 (1,080)
> Seat 5: tishlidji (300)
> Seat 6: brownalan (200)
> Seat 7: Pretty Pam 10 (3,355)
> Seat 8: larrydj (31,142)
> Seat 9: johnvfardella (200)
> stigs2 posts the small blind of 5
> PeggySue07 posts the big blind of 10
> brownalan posts 10
> johnvfardella posts 10
> larrydj posts a dead small blind of 5
> larrydj posts 10
> The button is in seat #2
> *** HOLE CARDS ***
> Dealt to brownalan [Qh Qd]
> tishlidji calls 10
> brownalan raises to 20
> Pretty Pam 10 calls 20
> larrydj calls 10
> johnvfardella calls 10
> miannie has 8 seconds left to act
> johnvfardella is feeling happy
> miannie calls 20
> LATUK calls 20
> stigs2 calls 15
> PeggySue07 calls 10
> tishlidji calls 10
> *** FLOP *** [Kc 5d 2s]
> stigs2 has 8 seconds left to act
> stigs2 checks
> PeggySue07 checks
> tishlidji checks
> brownalan checks
> Pretty Pam 10 checks
> larrydj checks
> johnvfardella: Hi Pam
> johnvfardella checks
> miannie checks
> LATUK checks
> *** TURN *** [Kc 5d 2s] [2c]
> stigs2 has 8 seconds left to act
> stigs2 bets 20
> PeggySue07 calls 20
> tishlidji calls 20
> johnvfardella: looking good
> brownalan raises to 40
> Pretty Pam 10 folds
> larrydj calls 40
> johnvfardella calls 40
> miannie folds
> LATUK folds
> stigs2 raises to 60
> PeggySue07 calls 40
> tishlidji calls 40
> brownalan calls 20
> larrydj calls 20
> johnvfardella calls 20
> *** RIVER *** [Kc 5d 2s 2c] [7s]
> Pretty Pam 10: hello
> stigs2 has 8 seconds left to act
> stigs2 bets 20
> PeggySue07 calls 20
> tishlidji raises to 40
> brownalan has 8 seconds left to act
> brownalan folds
> larrydj folds
> johnvfardella calls 40
> stigs2 raises to 60
> PeggySue07 folds
> tishlidji raises to 80
> johnvfardella folds
> stigs2 calls 20
> *** SHOW DOWN ***
> tishlidji shows [7h 2h] a full house, Twos full of Sevens
> stigs2 shows [Ks 2d] a full house, Twos full of Kings
> stigs2 wins the pot (765) with a full house, Twos full of Kings
> *** SUMMARY ***
> Total pot 765 | Rake 0
> Board: [Kc 5d 2s 2c 7s]
> Seat 1: miannie folded on the Turn
> Seat 2: LATUK (button) folded on the Turn
> Seat 3: stigs2 (small blind) showed [Ks 2d] and won (765) with a full 
> house, Twos full of Kings
> Seat 4: PeggySue07 (big blind) folded on the River
> Seat 5: tishlidji showed [7h 2h] and lost with a full house, Twos full 
> of Sevens
> Seat 6: brownalan folded on the River
> Seat 7: Pretty Pam 10 folded on the Turn
> Seat 8: larrydj folded on the River
> Seat 9: johnvfardella folded on the River
>
> On Mon, Mar 17, 2008 at 8:54 PM, Austin Hastings 
> <Austin_Hastings at yahoo.com <mailto:Austin_Hastings at yahoo.com>> wrote:
>
>     Alan,
>
>     1. How about giving us some example text?
>
>     2. Create a symbol table. This is a "higher level" solution, but
>     probably right.
>
>     3. It may be that your text is more amenable to parsing with a "lower
>     level" approach. Possibly more use of regular expressions is needed.
>
>     Before you make any permanent decisions, see #1.
>
>     =Austin
>
>      alan brown wrote:
>     > I'm having a problem with my lexer/parser design.  I'm trying to
>     parse
>     > a poker hand history file and extract the relevant information.
>      I got
>     > a working solution but it's quite brittle.  My problem is that my
>     > lexer is creating tokens of the words and my parser is reading those
>     > words to work out player names and bets and so on.  My issue is that
>     > the player names (among other things) is causing me grief.  A player
>     > name can be "alan 10 folds" which as you might imagine, can
>     cause some
>     > confusion.
>     >
>     > What I'd like to do is to parse the file to create the tokens
>     that the
>     > lexer would look for so when my parser runs over the tokens all the
>     > names are single tokens.
>     >
>     > How do I dynamically define the tokens for the lexer to parse?
>     >
>     > alan
>
>