The library has an algo called "ltok" that in place tokenises lines of text while removing blank lines and trimming the leading white space from each line but I recently needed an algo that purely tokenises the lines without any modification so I have written a new one.
It appears to be fast enough but in using it you must be able to deal with a pointer to a NULL string which is what you end up with when the text has an empty line.
the test piece is not particularly elegant, its just there to test the algo. To see its results redirect the console output to a file.
line_tokenize yourfile.ext > result.ext
This is the test piece.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
line_tokenize PROTO :DWORD,:DWORD
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL hMem :DWORD
LOCAL flen :DWORD
LOCAL hArr :DWORD
LOCAL lcnt :DWORD
push ebx
push esi
push edi
mov hMem, InputFile("\masm32\include\windows.inc")
mov flen, ecx
mov lcnt, rv(line_tokenize,hMem,ADDR hArr) ; <<<< NOTE the ADDR of hArr
mov esi, hArr
mov edi, lcnt
xor ebx, ebx
lbl0:
add ebx, 1
fn StdOut,[esi] ; display line
fn StdOut,chr$(13,10) ; add CRLF
add esi, 4
cmp ebx, edi
jl lbl0
free hArr
free hMem
pop edi
pop esi
pop ebx
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
line_tokenize proc ptxt:DWORD,hArr:DWORD
; --------
; On ENTRY
; --------
; ptxt is the address of the text to tokenize.
; hArr is the address of a variable to receive the pointer array address
; -------
; On EXIT
; -------
; 1. pointer array has the address of each line
; 2. the return value in EAX is the line count
; -------------------------------
; When data is no longer required
; -------------------------------
; The passed variable "hArr" must be freed with GlobalFree()
; or the macro "free"
mov ecx, [esp+4] ; ptxt
xor edx, edx
sub ecx, 1
; --------------------------
; count the carriage returns
; --------------------------
lp1:
add ecx, 1
movzx eax, BYTE PTR [ecx]
test eax, eax
jz lout
cmp eax, 13
jne lp1
add edx, 1
jmp lp1
; -----------------------------------
; set buffer size and allocate memory
; -----------------------------------
lout:
add edx, 1
push edx ; preserve count on stack
lea edx, [edx*4] ; mul edx * 4
add edx, 64
mov edx, alloc(edx) ; allocate the pointer array memory
mov eax, [esp+8][4] ; hArr - write the passed handle address to EAX
mov [eax], edx ; write the allocated memory handle to it
mov ecx, [esp+4][4] ; ptxt - text address in ECX
mov [edx], ecx ; 1st line address in 1st DWORD in EDX
add edx, 4
sub ecx, 1
mainloop:
add ecx, 1
backin:
movzx eax, BYTE PTR [ecx] ; test each byte
test eax, eax ; exit on zero
jz lpout
cmp eax, 13 ; test for CR
jne mainloop ; loop back if not
mov BYTE PTR [ecx], 0 ; overwrite 13 with 0
add ecx, 2 ; set next line start
mov [edx], ecx ; store it in the pointer array
add edx, 4
jmp backin
lpout:
pop eax ; restore count into EAX as return value
ret 8
line_tokenize endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
I tested it redirecting the output to a file "text.txt" and it appears to be a duplication of the
windows.inc. Is it what we are expecting, or something has changed but I did not notice?
I'm not sure I've understood how can the tokenizer be used. ::)
Frank,
The test piece converts the source file into an array of pointers to each line. That is the tokeniser. The calling code simply puts it back together to show that it works properly.
Quote from: hutch-- on December 19, 2010, 10:41:53 AM
Frank,
The test piece converts the source file into an array of pointers to each line. That is the tokeniser. The calling code simply puts it back together to show that it works properly.
ok :U Thanks Steve
ECX = Buffer size
EDI = Pointer to base
mov al, 10 ; I used LF as EDI will point to first byte of next string
xor edx, edx
push ebp
mov ebp, esp
@@: repne scasb
inc edx
jcxz @F
push edi
mov word ptr [edi - 2], 0
jmp @B
@@:
leave
This in essence replaces most everything in line_tokenize except creating a global buffer and moving contents of stack to it.
I have a few ideas beyond this, but would need to know scope and purpose.
jecxz @F
:P
Quote from: dedndave on December 27, 2010, 01:41:26 AM
jecxz @F
:P
That would be some bug to catch, a file over 65636 bytes and coincidentally have a linefeed fall on a 65K boundry.
exactly what i was thinking - lol
so far, i THINK i have managed to catch myself from making the same mistake
jcxz is old habit from DOS days
they tell me that OR ECX,ECX/JZ is faster, anyways
:bg
The posted version comfortably works on a half gigabyte of data. Its hard to do much bigger as the 2 gig address space starts to get in the road.
1. What would the average filesize be?
2. Is it necessary pointers reside in memory? As my algo could write a set of 10,000 to disk or even memory and use ReAlloc without being to weighty on stack
3. I believe there are apps in Linux called head, tail, I believe, is the purpose here to mimic them?
if i know Hutch, he has an algo (that gets dword-aligned) to replace REPNZ SCASB
he tends to avoid SSE code for the masm32 lib - he may have something that uses MMX
The algo is not designed around an average file size, its designed to work to the limit of memory. It creates the array of pointers as it scans the source replacing 13 with 0 and setting the beginning of the next line to the next pointer array member. The size of the algo is trivial but it is genuinely fast and that is the main criterion, tokenise a file to an array of pointers while terminating each line with zero.
I have no doubt that similar tasks can be done in many different ways but they will tend to be much slower and that would defeat the purpose of what this algo must do. An alternative would be to allocate a fixed sized memory block, reallocate it each time the pointer count gets up near its end then reallocate irt at the end to truncate any extra memory but reallocations are slow so it may not save any time over the scan that gets the CR count.
Dave, with byte aligned data of this type, it would be nearly impossible to use multi media instructions and if you did it would get badly slugged with the alignment and setup overhead.
Quote from: hutch-- on December 27, 2010, 04:14:07 AMIt creates the array of pointers as it scans the source replacing 13 with 0 and setting the beginning of the next line to the next pointer array member. The size of the algo is trivial but it is genuinely fast and that is the main criterion, tokenise a file to an array of pointers while terminating each line with zero.
I have no doubt that similar tasks can be done in many different ways but they will tend to be much slower ...
If you want it much faster, check Alex' algo (http://www.masm32.com/board/index.php?topic=11061.msg128952#msg128952). MB's Recall (http://www.masm32.com/board/index.php?topic=12460.msg129612#msg129612) combines allocation, reading the file and tokenising it all in one, and is also really fast. They are both SSE2, of course. In the end, for most real tasks the extra speed does not matter because allocating the memory and reading in the file is a lot slower than the tokenising.
JJ,
I had a quick look at the link you mentioned and Alex's algo appears to be doing something else related to counting lines but I don't see it or the link to your code writing an in memory array of pointers to lines in a file also in memory.
The task it performs is something like this.
; from this
text db "This is a line of text",13,10
; to this
text db "This is a line of text",0 ; terminates each line with zero
It stores the beginning of each line in a DWORD array and the array count is the return value. It is an in place tokeniser overwriting the original string. You have 2 addresses to deal with, the address of the original string which is the array content and an array of pointers to the lines in the string. the disk IO is simply to demonstrate that it works.
Quote from: hutch-- on December 27, 2010, 11:49:08 AM
I had a quick look at the link you mentioned and Alex's algo appears to be doing something else related to counting lines but I don't see it or the link to your code writing an in memory array of pointers to lines in a file also in memory.
Hutch,
Recall does exactly that: It loads a file into HeapAlloc'ed memory, then replaces ASCII 13 with nullbytes and stores the address to an array of pointers. It also stores the length of the strings in the same array, so one string takes 8 bytes, a dword for the start and another one for the length. This is to allow substituting the original string with a new (HeapAlloc'ed) string, e.g. Let My$(123)="This is a new string".
If there would be a need to do the same without loading the file, I could cook up a flag allowing that. But I can't see that need, honestly.
Be interesting to see the comparison. here is the simple benchmark for the tokeniser.
172 ms for tokenising 2673001 lines of text
Press any key to continue ...
With the 101 meg file it runs at just under 600 meg/sec including both the allocation and deallocation on this core2 quad.
To run the test run the exe file BLDTST.EXE first to create the test piece.
It will run on a 486.
688 ms for tokenising 2673001 lines of text (Hutch)
359 ms for tokenising 2673000 lines of text (MasmBasic)
How long does the other line take JJ? :lol
:lol
i was wondering which is correct, myself
and - if the last line of the file has no CR or LF, does it still count as a line ?
next question - if you are going to count the last line that has no CR or LF, i take it it must have some other char in order to be a line
how many lines in this file ?
10,10,13,13,13,13,10,10,10,10,10,10,13,13,13
Quote from: oex on December 27, 2010, 03:32:36 PM
How long does the other line take JJ? :lol
Depends on whether you count a trailing CrLf as a "line" :wink
Quote from: dedndave on December 27, 2010, 03:48:26 PM
:lol
i was wondering which is correct, myself
and - if the last line of the file has no CR or LF, does it still count as a line ?
next question - if you are going to count the last line that has no CR or LF, i take it it must have some other char in order to be a line
how many lines in this file ?
10,10,13,13,13,13,10,10,10,10,10,10,13,13,13
Two with Recall - correctly, if end of line is defined as CrLf.
well, i can tell you that a printer would see 8 lines
for this one...
10,10,13,13,13,13,10,10,10,10,10,10,13,65,13,13
the printer would see 9 lines
it counts line feeds only
if there is a non-control character between the last line feed and EOF, add an additional line to the count
of course, form feed adds a few lines :P
you have to know how many lines per page, then count back and supplement - yikes !
Well, to be precise, Recall returns 2 lines and 7 BadLines. You need rules, and CrLf is CrLf, not LF.
Notepad sees two lines, MS Word and QEditor see 12, Notepad++ arrives at 15, WordPad 13...
These are the times from the version that JJ posted.
266 ms for tokenising 2673001 lines of text
156 ms for tokenising 2673000 lines of text
Press any key to continue ...
I am guessing its an SSE version but with no source supplied I cannot build it. The tokeniser at the top of this topic runs on a 486 and is fully backwards compatible with any OS version that will run 32 bit Windows.
This is as close as I can identify it from disassembly.
fn_00401D50:
push ebp
mov ebp, esp
sub esp, 34h
pushad
lea eax, [ebp-34h]
movups [eax], xmm0
movups [eax+10h], xmm1
mov eax, [ebp+8]
xor edi, edi
cmp eax, 0Ah
jbe lbl0
push eax
push edi
push eax
push 9
call fn_00402348
pop edx
push eax
sub eax, 0FFFFFFFFh
je loc_00402721
push 9
pop eax
jmp lbl1
lbl0:
push DWORD PTR [4072B6h+eax*4]
lbl1:
and [4072B6h+eax*4], edi
push edi
push DWORD PTR [esp+4]
call GetFileSize
mov [ebp-10h], eax
lea ebx, [eax+4]
shr ebx, 1
mov esi, eax
add eax, 29h
push edi
push eax
call fn_004027EF
mov [ebp-14h], eax
test al, 0Fh
jz lbl2
add eax, 8
lbl2:
xorps xmm0, xmm0
movups [eax+esi], xmm0
movups [eax+esi+10h], xmm0
mov edi, eax
mov edx, [esp]
push 0
lea eax, [ebp-8]
push eax
push esi
push edi
push edx
call ReadFile
call CloseHandle
push DWORD PTR [ebp+10h]
push ebx
push 0
call fn_00402CC7
mov [eax+4], edi
cmp edi, [ebp-14h]
jz lbl3
inc DWORD PTR [eax+4]
lbl3:
mov edx, [ebp+14h]
cmp edx, 0FFFFFF88h
jg lbl4
mov [eax+0Ch], edx
and DWORD PTR [ebp+14h], 0
lbl4:
add eax, 10h
mov [ebp-0Ch], eax
mov eax, 0A0A0A0Ah
movd xmm1, eax
pshufd xmm1, xmm1, 0
add esi, edi
mov ecx, [ebp-0Ch]
push DWORD PTR [ebp+14h]
push ebx
mov ebx, ecx
mov [ebx], edi
and DWORD PTR [405F50h], 0
xor eax, eax
lbl5:
movaps xmm0, xmm1
pcmpeqb xmm0, [edi+10h]
pmovmskb ecx, xmm0
movaps xmm0, xmm1
pcmpeqb xmm0, [edi]
pmovmskb edx, xmm0
lea edi, [edi+20h]
shl ecx, 10h
cmp edi, esi
jnb lbl6
or edx, ecx
jz lbl5
call fn_00401ED6
jmp lbl5
lbl6:
or edx, ecx
jz lbl7
call fn_00401ED6
lbl7:
cmp BYTE PTR [esi-1], 0Ah
jz lbl8
inc eax
add esi, 2
lbl8:
mov edx, [esp+4]
test edx, edx
jz lbl10
bswap edx
cmp dh, 91h
jnz lbl9
inc eax
jmp lbl10
lbl9:
bswap edx
mov [edx], eax
lbl10:
mov [ebx+eax*8], esi
pop ebx
pop edi
mov edi, [ebp-0Ch]
mov [edi-10h], eax
mov ecx, eax
lbl11:
mov edx, [edi]
sub edx, [edi+8]
add edx, 2
mov [edi+4], edx
add edi, 8
dec ecx
jg lbl11
mov [esp+1Ch], eax
xor edi, edi
push edi
push eax
push DWORD PTR [ebp+0Ch]
call fn_00402CC7
push DWORD PTR [ebp+0Ch]
push edi
call fn_00402FFF
push edi
call fn_00402EDC
lea eax, [ebp-34h]
movups xmm0, [eax]
movups xmm1, [eax+10h]
popad
mov edx, [ebp-10h]
mov esp, ebp
pop ebp
ret 10h
lbl12:
pop esi
ret
Quote from: hutch-- on December 27, 2010, 09:15:25 PM
These are the times from the version that JJ posted.
266 ms for tokenising 2673001 lines of text
156 ms for tokenising 2673000 lines of text
Press any key to continue ...
I am guessing its an SSE version but with no source supplied I cannot build it. The tokeniser at the top of this topic runs on a 486 and is fully backwards compatible with any OS version that will run 32 bit Windows.
This is as close as I can identify it from disassembly.
...
Hutch,
It used to be available as getlinesjj through the Forum's search function - there was an endless thread testing this algo and variants.
Here is the innermost loop (with some inspiration from Lingo, NightWare and Agner Fog):
L1: movaps xmm0, xmm1 ; linefeeds in xmm0 & xmm1
pcmpeqb xmm0, [edi+16] ; compare packed bytes in [m128] and xmm0 for equality
pmovmskb ecx, xmm0 ; set byte mask in edx for second 16 byte chunk
movaps xmm0, xmm1 ; linefeeds in xmm0 (pcmpeqb resets xmm0 to zero - we must reload it!)
pcmpeqb xmm0, [edi] ; compare packed bytes in [m128] and xmm0 for equality
pmovmskb edx, xmm0 ; set byte mask in edx for first 16 byte chunk
lea edi, [edi+32] ; point to next chunk
shl ecx, 16
cmp edi, esi ; test boundary
jae LastString
or edx, ecx
jz L1
P.S.: The source is included in toktok.zip - all you need is include \masm32\MasmBasic\MasmBasic.inc (http://www.masm32.com/board/index.php?topic=12460) :wink