An in place tokeniser.

hutch-- · June 23, 2006, 01:43:47 AM

Following from a discussion I have had elsewhere I thought I would put together a tokeniser that leaves the data in place, returns an array of pointers and array count and modifies the data pointed at the algo by writing a terminating zero after each token in the data.

The virtue of doing it this way is you don't have to move or copy the data and you deal with the pointer memory allocation by running a string length algo, in this cae, Agner Fog's then you do a single string modification pass that stores the pointer to each token and terminates each token.

It seems to be test up OK for a prototype algorithm but to get meaningful timings you need a very large sample. The demo has windows.inc but at 1.2 meg its too small to show a timing most of the time. I tested it out on a 320 meg sample and it ran it in about 1.5 seconds so it is probably fast enough to be useful.

It handles MASM style alternate quotes and the delimiters are determined by the characters in the table. This test is set up for normal white space only which is with a space or a tab but by tweaking the table, it can easily handle , ; : _ and any other characters you may need for this task.

I have written it with a logic that if a quote of either type is not preceded by a delimiter, the algo does not treat the character as a quotation and if you have an unmatched quote, it will treat the rest of the text as a single argument. It probably should be flagged as an error if this happens but I have not written this into the algo yet.

[attachment deleted by admin]

James Ladd · June 25, 2006, 05:11:44 AM

WOW, great piece of work.

KSS · June 25, 2006, 08:58:46 PM

Mr. Hutch,
where You use "An in place tokeniser"? In real project?
(I am not known where I may use this)

James Ladd · June 26, 2006, 01:18:54 AM

KSS,
Im probably going to use the inplace tokenizer in the interpreter I am writing.

hutch-- · June 26, 2006, 01:30:03 AM

Loading data into an array for sorting, command line parsing if you like the old style C stuff, more or less anything where you need words loaded into an array.

KSS · June 26, 2006, 06:46:39 PM

Excuse me, my error... :red
I have just finished a writing of the program thad work with large text files but my program must edit loaded in memory files. So my brain has made a mistake. :(

tenkey · June 27, 2006, 02:30:42 AM

Quote from: James Ladd on June 26, 2006, 01:18:54 AM
KSS,
Im probably going to use the inplace tokenizer in the interpreter I am writing.

For that purpose, you may need to eliminate or delay the in-place termination, or store string lengths with the pointers.

Consider what happens if you terminate all token strings before using them, in expressions like

(a+b*c)

James Ladd · June 27, 2006, 04:17:09 AM

Good point TenKey,
Maybe I can remove the termination and do pointer arithmatic for lengths, especially if the tokens are
ordered.

James Ladd · July 05, 2006, 09:41:56 PM

Hutch,

Is there any advantage in using the approach that Donkey used in his string length algo by
looking at a dword at a time rather than a byte?

I guess such an approach makes more sense for parsing things that might have lines to
ignore like a ';' in assembler.

Could you also explain why in the ltok2 example you moved away from a table and how
the new code functions and what you think are the advantages over a table based
approach ?

Rgs, James.

hutch-- · July 06, 2006, 06:40:55 AM

James,

The generic 4 byte algo that Agner Fog designed is very well suited for determining a single byte but sequential locations in a byte array with different starting characters is not what it is useful for.

The second version of the line tokeniser is a lot simpler than the word tokeniser and does not need the spread of characters as it mainly determines the CRLF to set the end of the line and you can use a single compare for 32 and under rather than test the character from a table. The first loop just scans through the source ignoring anything below ascii 33 until it finds a line start character above that value.

The character in the source must be obtained as a memory access but by not using the table to determine its type, you save an additional memory access each iteration of the loop which should make it faster.

James Ladd · July 06, 2006, 06:48:00 AM

Hutch,

Thanks.

I find whenever I run the example the reported time is always the same.
Milliseconds 24.

Could there be a bug as I tried it on a 132M file and I know the computer and your code is fast
but this seems wrong.

hutch-- · July 06, 2006, 07:03:21 AM

James,

I think the original was overwriting EAX. Substitute the part for the benchmark with this code.

Code Select


    invoke GetTickCount
    push eax

    mov acnt, rv(ltok,hMem,ADDR pArr)

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    push eax        ; preserve EAX

    print "Milliseconds duration = "
    pop eax         ; restore EAX
    print str$(eax),13,10

James Ladd · July 06, 2006, 11:02:01 PM

Hutch,

Oh my, did I find a bug in your code? :dazzled:

I'll give it a try.

On the same topic, if you wanted to parse some input and take different actions based on the
existance of characters in the input, what would be your suggested approach to this?
For example, you might want to read all input upto a certain character, or you might want to
start doing processing X when character '+' is found in the input?

Would you use your table based parsing for this?
Would that table be a table of 'jmp' addresses or pointers to functions to call.

Im interested in how to develop a parser that is as fast as possible.

Im also interested in how to ignore input in an efficient manner too.
For example, if the input has a character that indicated a comment, then the rest of the input
is ignored upto and including a new-line.
This last part is where I thought the agner/donkey approach of reading and comparing DWORD values
could help.

What do you think ?

Rgs, James.

hutch-- · July 07, 2006, 11:06:49 AM

James,

The types of parser deigns I have played with over time lean towards table based distinctions where you can if you need to make the seperations of character classes reasonably complex. Think of a table that has unacceptable characters set to zero, then with the remaining characters, you can then divide them up between acceptable variable name characters with 1 in the table, punctuation set to 2 in the table, various types of brackets set to 3 in the table and with any of the defined character classes you can individually scan for specific characters within that class to branch off to specialised procesing.

Quotation handlers are basically done this way, determine which type of quote, branch to a quote handler and not return until you have found either the closing quote or a quotation error. In most instances a quotation error would be either a CR or LF or a terminator before the closing quote. tables have a higher overhead as they perform more memory accesses than a simple line terminator or similar but they have the advantage that they have about the same overhead when testing for the whole character range where other methods get far slower as the character range to be tested gets larger.

James Ladd · July 07, 2006, 11:56:00 PM

Hutch,

So if the initial table contains a 0,1,2 or 3 then do you just cmp and jmp to handle the
processing indicated by the number? ie: its a 3 so ignore the rest of the line

Would it be faster / more efficient to just jmp to an address in the cell if size of the table wasnt
an issue?

I guess im wondering is the initial table of 0,1,2,3 etc is necessary or if there is a performance
benefit to this approach.

Hutch ?

Rgs, James.

News:

An in place tokeniser.