The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: hutch-- on December 19, 2010, 12:34:40 AM

Title: New line tokeniser.
Post by: hutch-- on December 19, 2010, 12:34:40 AM
The library has an algo called "ltok" that in place tokenises lines of text while removing blank lines and trimming the leading white space from each line but I recently needed an algo that purely tokenises the lines without any modification so I have written a new one.

It appears to be fast enough but in using it you must be able to deal with a pointer to a NULL string which is what you end up with when the text has an empty line.

the test piece is not particularly elegant, its just there to test the algo. To see its results redirect the console output to a file.


line_tokenize yourfile.ext > result.ext


This is the test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    line_tokenize PROTO :DWORD,:DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL hMem  :DWORD
    LOCAL flen  :DWORD
    LOCAL hArr  :DWORD
    LOCAL lcnt  :DWORD

    push ebx
    push esi
    push edi

    mov hMem, InputFile("\masm32\include\windows.inc")
    mov flen, ecx

    mov lcnt, rv(line_tokenize,hMem,ADDR hArr)  ; <<<< NOTE the ADDR of hArr

    mov esi, hArr
    mov edi, lcnt
    xor ebx, ebx

  lbl0:
    add ebx, 1

    fn StdOut,[esi]             ; display line
    fn StdOut,chr$(13,10)       ; add CRLF

    add esi, 4
    cmp ebx, edi
    jl lbl0

    free hArr
    free hMem

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

line_tokenize proc ptxt:DWORD,hArr:DWORD

  ; --------
  ; On ENTRY
  ; --------
  ; ptxt is the address of the text to tokenize.
  ; hArr is the address of a variable to receive the pointer array address

  ; -------
  ; On EXIT
  ; -------
  ; 1. pointer array has the address of each line
  ; 2. the return value in EAX is the line count

  ; -------------------------------
  ; When data is no longer required
  ; -------------------------------
  ; The passed variable "hArr" must be freed with GlobalFree()
  ; or the macro "free"

    mov ecx, [esp+4]            ; ptxt
    xor edx, edx
    sub ecx, 1

  ; --------------------------
  ; count the carriage returns
  ; --------------------------
  lp1:
    add ecx, 1
    movzx eax, BYTE PTR [ecx]
    test eax, eax
    jz lout
    cmp eax, 13
    jne lp1
    add edx, 1
    jmp lp1

  ; -----------------------------------
  ; set buffer size and allocate memory
  ; -----------------------------------
  lout:
    add edx, 1
    push edx                    ; preserve count on stack
    lea edx, [edx*4]            ; mul edx * 4
    add edx, 64
    mov edx, alloc(edx)         ; allocate the pointer array memory
    mov eax, [esp+8][4]         ; hArr - write the passed handle address to EAX
    mov [eax], edx              ; write the allocated memory handle to it
    mov ecx, [esp+4][4]         ; ptxt - text address in ECX
    mov [edx], ecx              ; 1st line address in 1st DWORD in EDX
    add edx, 4
    sub ecx, 1

  mainloop:
    add ecx, 1
  backin:
    movzx eax, BYTE PTR [ecx]   ; test each byte
    test eax, eax               ; exit on zero
    jz lpout
    cmp eax, 13                 ; test for CR
    jne mainloop                ; loop back if not

    mov BYTE PTR [ecx], 0       ; overwrite 13 with 0
    add ecx, 2                  ; set next line start
    mov [edx], ecx              ; store it in the pointer array
    add edx, 4
    jmp backin
   
  lpout:
    pop eax                     ; restore count into EAX as return value

    ret 8

line_tokenize endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: New line tokeniser.
Post by: frktons on December 19, 2010, 10:34:30 AM
I tested it redirecting the output to a file "text.txt" and it appears to be a duplication of the
windows.inc. Is it what we are expecting, or something has changed but I did not notice?

I'm not sure I've understood how can the tokenizer be used.  ::)
Title: Re: New line tokeniser.
Post by: hutch-- on December 19, 2010, 10:41:53 AM
Frank,

The test piece converts the source file into an array of pointers to each line. That is the tokeniser. The calling code simply puts it back together to show that it works properly.
Title: Re: New line tokeniser.
Post by: frktons on December 19, 2010, 10:46:26 AM
Quote from: hutch-- on December 19, 2010, 10:41:53 AM
Frank,

The test piece converts the source file into an array of pointers to each line. That is the tokeniser. The calling code simply puts it back together to show that it works properly.

ok  :U Thanks Steve
Title: Re: New line tokeniser.
Post by: Tight_Coder_Ex on December 27, 2010, 01:34:46 AM
ECX = Buffer size
EDI = Pointer to base

    mov al, 10 ; I used LF as EDI will point to first byte of next string
    xor edx, edx
    push ebp
    mov ebp, esp
   
    @@: repne scasb
    inc edx
    jcxz @F
    push edi
    mov word ptr [edi - 2], 0
    jmp @B   
    @@:
   
    leave


This in essence replaces most everything in line_tokenize except creating a global buffer and moving contents of stack to it.
I have a few ideas beyond this, but would need to know scope and purpose.
Title: Re: New line tokeniser.
Post by: dedndave on December 27, 2010, 01:41:26 AM
jecxz @F

:P
Title: Re: New line tokeniser.
Post by: Tight_Coder_Ex on December 27, 2010, 01:47:41 AM
Quote from: dedndave on December 27, 2010, 01:41:26 AM
jecxz @F

:P

That would be some bug to catch, a file over 65636 bytes and coincidentally have a linefeed fall on a 65K boundry.
Title: Re: New line tokeniser.
Post by: dedndave on December 27, 2010, 01:58:13 AM
exactly what i was thinking - lol
so far, i THINK i have managed to catch myself from making the same mistake
jcxz is old habit from DOS days
they tell me that OR ECX,ECX/JZ is faster, anyways
Title: Re: New line tokeniser.
Post by: hutch-- on December 27, 2010, 02:19:11 AM
 :bg

The posted version comfortably works on a half gigabyte of data. Its hard to do much bigger as the 2 gig address space starts to get in the road.
Title: Re: New line tokeniser.
Post by: Tight_Coder_Ex on December 27, 2010, 02:30:26 AM
1. What would the average filesize be?
2. Is it necessary pointers reside in memory?  As my algo could write a set of 10,000 to disk or even memory and use ReAlloc without being to weighty on stack

3. I believe there are apps in Linux called head, tail, I believe, is the purpose here to mimic them?
Title: Re: New line tokeniser.
Post by: dedndave on December 27, 2010, 02:37:06 AM
if i know Hutch, he has an algo (that gets dword-aligned) to replace REPNZ SCASB
he tends to avoid SSE code for the masm32 lib - he may have something that uses MMX
Title: Re: New line tokeniser.
Post by: hutch-- on December 27, 2010, 04:14:07 AM
The algo is not designed around an average file size, its designed to work to the limit of memory. It creates the array of pointers as it scans the source replacing 13 with 0 and setting the beginning of the next line to the next pointer array member. The size of the algo is trivial but it is genuinely fast and that is the main criterion, tokenise a file to an array of pointers while terminating each line with zero.

I have no doubt that similar tasks can be done in many different ways but they will tend to be much slower and that would defeat the purpose of what this algo must do. An alternative would be to allocate a fixed sized memory block, reallocate it each time the pointer count gets up near its end then reallocate irt at the end to truncate any extra memory but reallocations are slow so it may not save any time over the scan that gets the CR count.

Dave, with byte aligned data of this type, it would be nearly impossible to use multi media instructions and if you did it would get badly slugged with the alignment and setup overhead.
Title: Re: New line tokeniser.
Post by: jj2007 on December 27, 2010, 10:49:02 AM
Quote from: hutch-- on December 27, 2010, 04:14:07 AMIt creates the array of pointers as it scans the source replacing 13 with 0 and setting the beginning of the next line to the next pointer array member. The size of the algo is trivial but it is genuinely fast and that is the main criterion, tokenise a file to an array of pointers while terminating each line with zero.

I have no doubt that similar tasks can be done in many different ways but they will tend to be much slower ...

If you want it much faster, check Alex' algo (http://www.masm32.com/board/index.php?topic=11061.msg128952#msg128952). MB's Recall (http://www.masm32.com/board/index.php?topic=12460.msg129612#msg129612) combines allocation, reading the file and tokenising it all in one, and is also really fast. They are both SSE2, of course. In the end, for most real tasks the extra speed does not matter because allocating the memory and reading in the file is a lot slower than the tokenising.
Title: Re: New line tokeniser.
Post by: hutch-- on December 27, 2010, 11:49:08 AM
JJ,

I had a quick look at the link you mentioned and Alex's algo appears to be doing something else related to counting lines but I don't see it or the link to your code writing an in memory array of pointers to lines in a file also in memory.

The task it performs is something like this.


  ; from this
  text db "This is a line of text",13,10
  ; to this
  text db "This is a line of text",0  ; terminates each line with zero


It stores the beginning of each line in a DWORD array and the array count is the return value. It is an in place tokeniser overwriting the original string. You have 2 addresses to deal with, the address of the original string which is the array content and an array of pointers to the lines in the string. the disk IO is simply to demonstrate that it works.
Title: Re: New line tokeniser.
Post by: jj2007 on December 27, 2010, 12:34:24 PM
Quote from: hutch-- on December 27, 2010, 11:49:08 AM
I had a quick look at the link you mentioned and Alex's algo appears to be doing something else related to counting lines but I don't see it or the link to your code writing an in memory array of pointers to lines in a file also in memory.

Hutch,

Recall does exactly that: It loads a file into HeapAlloc'ed memory, then replaces ASCII 13 with nullbytes and stores the address to an array of pointers. It also stores the length of the strings in the same array, so one string takes 8 bytes, a dword for the start and another one for the length. This is to allow substituting the original string with a new (HeapAlloc'ed) string, e.g. Let My$(123)="This is a new string".

If there would be a need to do the same without loading the file, I could cook up a flag allowing that. But I can't see that need, honestly.
Title: Re: New line tokeniser.
Post by: hutch-- on December 27, 2010, 12:54:26 PM
Be interesting to see the comparison. here is the simple benchmark for the tokeniser.


172 ms for tokenising 2673001 lines of text
Press any key to continue ...


With the 101 meg file it runs at just under 600 meg/sec including both the allocation and deallocation on this core2 quad.

To run the test run the exe file BLDTST.EXE first to create the test piece.

It will run on a 486.
Title: Re: New line tokeniser.
Post by: jj2007 on December 27, 2010, 03:22:17 PM
688 ms for tokenising 2673001 lines of text (Hutch)
359 ms for tokenising 2673000 lines of text (MasmBasic)
Title: Re: New line tokeniser.
Post by: oex on December 27, 2010, 03:32:36 PM
How long does the other line take JJ? :lol
Title: Re: New line tokeniser.
Post by: dedndave on December 27, 2010, 03:48:26 PM
 :lol
i was wondering which is correct, myself
and - if the last line of the file has no CR or LF, does it still count as a line ?
next question - if you are going to count the last line that has no CR or LF, i take it it must have some other char in order to be a line

how many lines in this file ?

10,10,13,13,13,13,10,10,10,10,10,10,13,13,13
Title: Re: New line tokeniser.
Post by: jj2007 on December 27, 2010, 04:05:17 PM
Quote from: oex on December 27, 2010, 03:32:36 PM
How long does the other line take JJ? :lol

Depends on whether you count a trailing CrLf as a "line" :wink
Title: Re: New line tokeniser.
Post by: jj2007 on December 27, 2010, 04:47:28 PM
Quote from: dedndave on December 27, 2010, 03:48:26 PM
:lol
i was wondering which is correct, myself
and - if the last line of the file has no CR or LF, does it still count as a line ?
next question - if you are going to count the last line that has no CR or LF, i take it it must have some other char in order to be a line

how many lines in this file ?

10,10,13,13,13,13,10,10,10,10,10,10,13,13,13

Two with Recall - correctly, if end of line is defined as CrLf.
Title: Re: New line tokeniser.
Post by: dedndave on December 27, 2010, 05:40:01 PM
well, i can tell you that a printer would see 8 lines

for this one...

10,10,13,13,13,13,10,10,10,10,10,10,13,65,13,13

the printer would see 9 lines

it counts line feeds only
if there is a non-control character between the last line feed and EOF, add an additional line to the count
of course, form feed adds a few lines   :P
you have to know how many lines per page, then count back and supplement - yikes !
Title: Re: New line tokeniser.
Post by: jj2007 on December 27, 2010, 06:10:51 PM
Well, to be precise, Recall returns 2 lines and 7 BadLines. You need rules, and CrLf is CrLf, not LF.
Notepad sees two lines, MS Word and QEditor see 12, Notepad++ arrives at 15, WordPad 13...
Title: Re: New line tokeniser.
Post by: hutch-- on December 27, 2010, 09:15:25 PM
These are the times from the version that JJ posted.


266 ms for tokenising 2673001 lines of text
156 ms for tokenising 2673000 lines of text
Press any key to continue ...


I am guessing its an SSE version but with no source supplied I cannot build it. The tokeniser at the top of this topic runs on a 486 and is fully backwards compatible with any OS version that will run 32 bit Windows.

This is as close as I can identify it from disassembly.


fn_00401D50:

    push ebp
    mov ebp, esp
    sub esp, 34h
    pushad
    lea eax, [ebp-34h]
    movups [eax], xmm0
    movups [eax+10h], xmm1
    mov eax, [ebp+8]
    xor edi, edi
    cmp eax, 0Ah
    jbe lbl0
    push eax
    push edi
    push eax
    push 9
    call fn_00402348
    pop edx
    push eax
    sub eax, 0FFFFFFFFh
    je loc_00402721
    push 9
    pop eax
    jmp lbl1

  lbl0:
    push DWORD PTR [4072B6h+eax*4]

  lbl1:
    and [4072B6h+eax*4], edi
    push edi
    push DWORD PTR [esp+4]
    call GetFileSize
    mov [ebp-10h], eax
    lea ebx, [eax+4]
    shr ebx, 1
    mov esi, eax
    add eax, 29h
    push edi
    push eax
    call fn_004027EF
    mov [ebp-14h], eax
    test al, 0Fh
    jz lbl2
    add eax, 8

  lbl2:
    xorps xmm0, xmm0
    movups [eax+esi], xmm0
    movups [eax+esi+10h], xmm0
    mov edi, eax
    mov edx, [esp]
    push 0
    lea eax, [ebp-8]
    push eax
    push esi
    push edi
    push edx
    call ReadFile
    call CloseHandle
    push DWORD PTR [ebp+10h]
    push ebx
    push 0
    call fn_00402CC7
    mov [eax+4], edi
    cmp edi, [ebp-14h]
    jz lbl3
    inc DWORD PTR [eax+4]

  lbl3:
    mov edx, [ebp+14h]
    cmp edx, 0FFFFFF88h
    jg lbl4
    mov [eax+0Ch], edx
    and DWORD PTR [ebp+14h], 0

  lbl4:
    add eax, 10h
    mov [ebp-0Ch], eax
    mov eax, 0A0A0A0Ah
    movd xmm1, eax
    pshufd xmm1, xmm1, 0
    add esi, edi
    mov ecx, [ebp-0Ch]
    push DWORD PTR [ebp+14h]
    push ebx
    mov ebx, ecx
    mov [ebx], edi
    and DWORD PTR [405F50h], 0
    xor eax, eax

  lbl5:
    movaps xmm0, xmm1
    pcmpeqb xmm0, [edi+10h]
    pmovmskb ecx, xmm0
    movaps xmm0, xmm1
    pcmpeqb xmm0, [edi]
    pmovmskb edx, xmm0
    lea edi, [edi+20h]
    shl ecx, 10h
    cmp edi, esi
    jnb lbl6
    or edx, ecx
    jz lbl5
    call fn_00401ED6
    jmp lbl5

  lbl6:
    or edx, ecx
    jz lbl7
    call fn_00401ED6

  lbl7:
    cmp BYTE PTR [esi-1], 0Ah
    jz lbl8
    inc eax
    add esi, 2

  lbl8:
    mov edx, [esp+4]
    test edx, edx
    jz lbl10
    bswap edx
    cmp dh, 91h
    jnz lbl9
    inc eax
    jmp lbl10

  lbl9:
    bswap edx
    mov [edx], eax

  lbl10:
    mov [ebx+eax*8], esi
    pop ebx
    pop edi
    mov edi, [ebp-0Ch]
    mov [edi-10h], eax
    mov ecx, eax

  lbl11:
    mov edx, [edi]
    sub edx, [edi+8]
    add edx, 2
    mov [edi+4], edx
    add edi, 8
    dec ecx
    jg lbl11
    mov [esp+1Ch], eax
    xor edi, edi
    push edi
    push eax
    push DWORD PTR [ebp+0Ch]
    call fn_00402CC7
    push DWORD PTR [ebp+0Ch]
    push edi
    call fn_00402FFF
    push edi
    call fn_00402EDC
    lea eax, [ebp-34h]
    movups xmm0, [eax]
    movups xmm1, [eax+10h]
    popad
    mov edx, [ebp-10h]
    mov esp, ebp
    pop ebp
    ret 10h

  lbl12:
    pop esi
    ret
Title: Re: New line tokeniser.
Post by: jj2007 on December 27, 2010, 11:44:22 PM
Quote from: hutch-- on December 27, 2010, 09:15:25 PM
These are the times from the version that JJ posted.


266 ms for tokenising 2673001 lines of text
156 ms for tokenising 2673000 lines of text
Press any key to continue ...


I am guessing its an SSE version but with no source supplied I cannot build it. The tokeniser at the top of this topic runs on a 486 and is fully backwards compatible with any OS version that will run 32 bit Windows.

This is as close as I can identify it from disassembly.
...

Hutch,
It used to be available as getlinesjj through the Forum's search function - there was an endless thread testing this algo and variants.
Here is the innermost loop (with some inspiration from Lingo, NightWare and Agner Fog):
L1: movaps xmm0, xmm1 ; linefeeds in xmm0 & xmm1
pcmpeqb xmm0, [edi+16] ; compare packed bytes in [m128] and xmm0 for equality
pmovmskb ecx, xmm0 ; set byte mask in edx for second 16 byte chunk
movaps xmm0, xmm1 ; linefeeds in xmm0 (pcmpeqb resets xmm0 to zero - we must reload it!)
pcmpeqb xmm0, [edi] ; compare packed bytes in [m128] and xmm0 for equality
pmovmskb edx, xmm0 ; set byte mask in edx for first 16 byte chunk
lea edi, [edi+32] ; point to next chunk
shl ecx, 16
cmp edi, esi ; test boundary
jae LastString
or edx, ecx
jz L1


P.S.: The source is included in toktok.zip - all you need is include \masm32\MasmBasic\MasmBasic.inc (http://www.masm32.com/board/index.php?topic=12460) :wink