News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

New line tokeniser.

Started by hutch--, December 19, 2010, 12:34:40 AM

Previous topic - Next topic

hutch--

The library has an algo called "ltok" that in place tokenises lines of text while removing blank lines and trimming the leading white space from each line but I recently needed an algo that purely tokenises the lines without any modification so I have written a new one.

It appears to be fast enough but in using it you must be able to deal with a pointer to a NULL string which is what you end up with when the text has an empty line.

the test piece is not particularly elegant, its just there to test the algo. To see its results redirect the console output to a file.


line_tokenize yourfile.ext > result.ext


This is the test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    line_tokenize PROTO :DWORD,:DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL hMem  :DWORD
    LOCAL flen  :DWORD
    LOCAL hArr  :DWORD
    LOCAL lcnt  :DWORD

    push ebx
    push esi
    push edi

    mov hMem, InputFile("\masm32\include\windows.inc")
    mov flen, ecx

    mov lcnt, rv(line_tokenize,hMem,ADDR hArr)  ; <<<< NOTE the ADDR of hArr

    mov esi, hArr
    mov edi, lcnt
    xor ebx, ebx

  lbl0:
    add ebx, 1

    fn StdOut,[esi]             ; display line
    fn StdOut,chr$(13,10)       ; add CRLF

    add esi, 4
    cmp ebx, edi
    jl lbl0

    free hArr
    free hMem

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

line_tokenize proc ptxt:DWORD,hArr:DWORD

  ; --------
  ; On ENTRY
  ; --------
  ; ptxt is the address of the text to tokenize.
  ; hArr is the address of a variable to receive the pointer array address

  ; -------
  ; On EXIT
  ; -------
  ; 1. pointer array has the address of each line
  ; 2. the return value in EAX is the line count

  ; -------------------------------
  ; When data is no longer required
  ; -------------------------------
  ; The passed variable "hArr" must be freed with GlobalFree()
  ; or the macro "free"

    mov ecx, [esp+4]            ; ptxt
    xor edx, edx
    sub ecx, 1

  ; --------------------------
  ; count the carriage returns
  ; --------------------------
  lp1:
    add ecx, 1
    movzx eax, BYTE PTR [ecx]
    test eax, eax
    jz lout
    cmp eax, 13
    jne lp1
    add edx, 1
    jmp lp1

  ; -----------------------------------
  ; set buffer size and allocate memory
  ; -----------------------------------
  lout:
    add edx, 1
    push edx                    ; preserve count on stack
    lea edx, [edx*4]            ; mul edx * 4
    add edx, 64
    mov edx, alloc(edx)         ; allocate the pointer array memory
    mov eax, [esp+8][4]         ; hArr - write the passed handle address to EAX
    mov [eax], edx              ; write the allocated memory handle to it
    mov ecx, [esp+4][4]         ; ptxt - text address in ECX
    mov [edx], ecx              ; 1st line address in 1st DWORD in EDX
    add edx, 4
    sub ecx, 1

  mainloop:
    add ecx, 1
  backin:
    movzx eax, BYTE PTR [ecx]   ; test each byte
    test eax, eax               ; exit on zero
    jz lpout
    cmp eax, 13                 ; test for CR
    jne mainloop                ; loop back if not

    mov BYTE PTR [ecx], 0       ; overwrite 13 with 0
    add ecx, 2                  ; set next line start
    mov [edx], ecx              ; store it in the pointer array
    add edx, 4
    jmp backin
   
  lpout:
    pop eax                     ; restore count into EAX as return value

    ret 8

line_tokenize endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

frktons

I tested it redirecting the output to a file "text.txt" and it appears to be a duplication of the
windows.inc. Is it what we are expecting, or something has changed but I did not notice?

I'm not sure I've understood how can the tokenizer be used.  ::)
Mind is like a parachute. You know what to do in order to use it :-)

hutch--

Frank,

The test piece converts the source file into an array of pointers to each line. That is the tokeniser. The calling code simply puts it back together to show that it works properly.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

frktons

Quote from: hutch-- on December 19, 2010, 10:41:53 AM
Frank,

The test piece converts the source file into an array of pointers to each line. That is the tokeniser. The calling code simply puts it back together to show that it works properly.

ok  :U Thanks Steve
Mind is like a parachute. You know what to do in order to use it :-)

Tight_Coder_Ex

ECX = Buffer size
EDI = Pointer to base

    mov al, 10 ; I used LF as EDI will point to first byte of next string
    xor edx, edx
    push ebp
    mov ebp, esp
   
    @@: repne scasb
    inc edx
    jcxz @F
    push edi
    mov word ptr [edi - 2], 0
    jmp @B   
    @@:
   
    leave


This in essence replaces most everything in line_tokenize except creating a global buffer and moving contents of stack to it.
I have a few ideas beyond this, but would need to know scope and purpose.

dedndave


Tight_Coder_Ex

Quote from: dedndave on December 27, 2010, 01:41:26 AM
jecxz @F

:P

That would be some bug to catch, a file over 65636 bytes and coincidentally have a linefeed fall on a 65K boundry.

dedndave

exactly what i was thinking - lol
so far, i THINK i have managed to catch myself from making the same mistake
jcxz is old habit from DOS days
they tell me that OR ECX,ECX/JZ is faster, anyways

hutch--

 :bg

The posted version comfortably works on a half gigabyte of data. Its hard to do much bigger as the 2 gig address space starts to get in the road.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Tight_Coder_Ex

1. What would the average filesize be?
2. Is it necessary pointers reside in memory?  As my algo could write a set of 10,000 to disk or even memory and use ReAlloc without being to weighty on stack

3. I believe there are apps in Linux called head, tail, I believe, is the purpose here to mimic them?

dedndave

if i know Hutch, he has an algo (that gets dword-aligned) to replace REPNZ SCASB
he tends to avoid SSE code for the masm32 lib - he may have something that uses MMX

hutch--

The algo is not designed around an average file size, its designed to work to the limit of memory. It creates the array of pointers as it scans the source replacing 13 with 0 and setting the beginning of the next line to the next pointer array member. The size of the algo is trivial but it is genuinely fast and that is the main criterion, tokenise a file to an array of pointers while terminating each line with zero.

I have no doubt that similar tasks can be done in many different ways but they will tend to be much slower and that would defeat the purpose of what this algo must do. An alternative would be to allocate a fixed sized memory block, reallocate it each time the pointer count gets up near its end then reallocate irt at the end to truncate any extra memory but reallocations are slow so it may not save any time over the scan that gets the CR count.

Dave, with byte aligned data of this type, it would be nearly impossible to use multi media instructions and if you did it would get badly slugged with the alignment and setup overhead.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on December 27, 2010, 04:14:07 AMIt creates the array of pointers as it scans the source replacing 13 with 0 and setting the beginning of the next line to the next pointer array member. The size of the algo is trivial but it is genuinely fast and that is the main criterion, tokenise a file to an array of pointers while terminating each line with zero.

I have no doubt that similar tasks can be done in many different ways but they will tend to be much slower ...

If you want it much faster, check Alex' algo. MB's Recall combines allocation, reading the file and tokenising it all in one, and is also really fast. They are both SSE2, of course. In the end, for most real tasks the extra speed does not matter because allocating the memory and reading in the file is a lot slower than the tokenising.

hutch--

JJ,

I had a quick look at the link you mentioned and Alex's algo appears to be doing something else related to counting lines but I don't see it or the link to your code writing an in memory array of pointers to lines in a file also in memory.

The task it performs is something like this.


  ; from this
  text db "This is a line of text",13,10
  ; to this
  text db "This is a line of text",0  ; terminates each line with zero


It stores the beginning of each line in a DWORD array and the array count is the return value. It is an in place tokeniser overwriting the original string. You have 2 addresses to deal with, the address of the original string which is the array content and an array of pointers to the lines in the string. the disk IO is simply to demonstrate that it works.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on December 27, 2010, 11:49:08 AM
I had a quick look at the link you mentioned and Alex's algo appears to be doing something else related to counting lines but I don't see it or the link to your code writing an in memory array of pointers to lines in a file also in memory.

Hutch,

Recall does exactly that: It loads a file into HeapAlloc'ed memory, then replaces ASCII 13 with nullbytes and stores the address to an array of pointers. It also stores the length of the strings in the same array, so one string takes 8 bytes, a dword for the start and another one for the length. This is to allow substituting the original string with a new (HeapAlloc'ed) string, e.g. Let My$(123)="This is a new string".

If there would be a need to do the same without loading the file, I could cook up a flag allowing that. But I can't see that need, honestly.