Print Page - Quicker file line read

Title: Quicker file line read
Post by: ChrisLeslie on January 24, 2008, 12:01:39 AM

Over the Christmas/New Year break I looked at some alternative assemblers to MASM. GoAsm was immediately appealing to me with my limited experience because of the lack of clutter in the necessary code. Thanks Jeremy for a great product. I don't understand why your assembler appears less popular than others.

My question is not go-specific however: I am trying to quicken a file read line procedure by reducing the number of invocations to Win APIs. What I have come up with is this:

Code Select

FileReadLine3 FRAME handle,ptrString   ; return EAX = 0 is EOF
  LOCAL dBytesRead
  push ecx
  push edx
  push edi
  mov edi,[ptrString]
  .L1:
    invoke ReadFile,[handle],edi,2,ADDR dBytesRead,0
    mov eax,[dBytesRead]
    cmp eax,2              ; check if not 2 bytes read
    jne > .exit            ; skip because EOF   
    cmp B[edi],13          ; check for end of even length lines
    je > .exit
    cmp B[edi+1],13        ; check for end of odd length lines
    je > .exit1
    add edi,2
    jmp < .L1
  .exit1:
    add edi,1
    invoke ReadFile,[handle],edi,1,ADDR dBytesRead,0   ; dummy read to advance read byte
    mov eax,[dBytesRead] 
  .exit:
    mov W[edi],0           ; stamp zero termination to the string
    pop edi
    pop edx
    pop ecx
  ret
ENDF

Does anybody have some thoughts on how to improve the performance of this?

Regards

Chris

Title: Re: Quicker file line read
Post by: donkey on January 24, 2008, 05:45:28 AM

If you want to speed things up you might consider a couple of alternatives:

1 - File mapping which allows you to address the file without API calls.
2 - Read the file in large chunks and process the chunks in memory.

File mapping is much faster than reading a file byte by byte however since it has to be marshalled, it is slower than reading the file into memory in chunks and processing it. The processing algorithm would be the same for both.

The following code segment will read file lines in memory, either allocated or mapped, and invoke a callback (pCallback) function passing pLineText (Pointer to a buffer containing the text), cbLine (Count of characters in pLineText), nLine (Line number in file) and pLine (Offset of the first character in the file) to it.

Code Select

mov esi,[pData]
mov edi,esi
mov ebx,[fSizeLo]
mov ecx,ebx
mov eax,0Ah

L1:
repne scasb
push ecx,eax

mov eax,edi
sub eax,esi
push eax
add eax,0Fh
and eax,0FFFFFFF0h
invoke HeapAlloc,[hHeap],HEAP_ZERO_MEMORY,eax
mov [pBuffer],eax
pop ecx
sub ecx,2
jns >
mov ecx,0
:
inc D[LineCount]
invoke copymemory,eax,esi,ecx

sub esi,[pData]
invoke [pCallback],[pBuffer],ecx,[LineCount],esi
or eax,eax
jz >.EARLYEXIT

invoke HeapFree,[hHeap],0,[pBuffer]
mov esi,edi

pop eax,ecx
or ecx,ecx
jnz <L1

Donkey

Title: Re: Quicker file line read
Post by: ChrisLeslie on January 24, 2008, 08:02:07 AM

Thank Donkey for your advice. Reading into large chunks of memory might be a nuisance when inevitably the end of the chunk does not coincide with the end of a line. File buffering to allocated memory and processing the map sounds more practical and should allow random access to file locations as well. I will work on this. Perhaps an "OpenBufferedFile" procedure followed by looping on a "BufferReadLine" procedure may provide a workable structure.

Regards

Chris

Title: Re: Quicker file line read
Post by: jj2007 on January 24, 2008, 09:49:49 AM

Unless you work with really large files (e.g. over 4MB), the easiest and fastest way is reading
the whole file into a buffer, and then look for CrLf's. Below is a snippet.
What I do in my own routines is to allocated a second chunk of memory that keeps addresses
of strings in memory. This allows a very fast access to each single line after the scan is complete.

Code Select


	invoke GetFileSize, hFile, ADDR dwHighSize
	.IF dwHighSize
	    invoke CloseHandle, hFile	; we won't deal with 4MB files
	    xor	eax,eax
	    ret
	.ENDIF
	mov dwFileSize, eax
	invoke ReadFile, hFile, BufStart, dwFileSize, ADDR dwBytesRead, 0
	mov	bSuccess,eax
	invoke	CloseHandle, hFile
	mov	edi,BufStart	;start of memory
	mov	ecx,FileSize	;ecx counts down to zero

; *** edi will point to a CRLF-terminated string. We scan for the CRLF.
	dec	edi	;edi=BufStart-1, for compensation of first inc

Get_ST:	mov	edx,ecx
	mov	eax,2573	;CrLF

Get_CR:	inc	edi	;position in buffer
	dec	ecx	;global byte counter
	jle	CrFound ; end of file encountered
	cmp	word ptr[edi],ax
	jne	Get_CR

CrFound:	sub	edx,ecx	;ecx file size downto 0, edx string len counter
	inc	StrCount
... etc. - move address somewhere, loop back to Get_CR

Title: Re: Quicker file line read
Post by: donkey on January 24, 2008, 03:25:49 PM

Quote from: ChrisLeslie on January 24, 2008, 08:02:07 AM
Thank Donkey for your advice. Reading into large chunks of memory might be a nuisance when inevitably the end of the chunk does not coincide with the end of a line. File buffering to allocated memory and processing the map sounds more practical and should allow random access to file locations as well. I will work on this. Perhaps an "OpenBufferedFile" procedure followed by looping on a "BufferReadLine" procedure may provide a workable structure.

Regards

Chris

Hi Chris,

The snippet I posted is from the ReadFileLines function in my Files library, it uses a memory mapped file, you can download it with GoAsm source from my website as an example if you like or use it directly.

Donkey

Title: Re: Quicker file line read
Post by: jorgon on January 25, 2008, 12:25:04 PM

Hi Chris!

How about:-

Code Select

DATA
hFile           DD 0    ;to hold file handle
hFileMap        DD 0    ;to hold file map handle
FileInMemory    DD 0    ;to hold starting address of map view
FileSize        DD 0    ;to hold size of the file
;
CODE
;
LOADFOR_FILELINEREAD:
PUSH 0,80h,3            ;3=open existing
PUSH 0,1h,80000000h     ;8=read access
PUSH ADDR BUFFER        ;filename in Unicode (use CreateFileA if filename in ANSI)
CALL CreateFileW        ;try to open file
CMP EAX,-1              ;see if INVALID_HANDLE_VALUE
JZ >L2                  ;yes, fail
MOV [hFile],EAX         ;keep file handle
PUSH 0,EAX              ;eax=file handle
CALL GetFileSize
CMP EAX,-1              ;see if failure
JZ >L1                  ;yes
MOV [FileSize],EAX      ;keep size of file
;
;*************** file map the whole file
PUSH 0,0,0,2h,0,[hFile] ;2h=PAGE_READONLY
CALL CreateFileMappingW ;filemap the file
OR EAX,EAX              ;see if failed
JZ >L1                  ;yes, failure to establish filemap
MOV [hFileMap],EAX      ;keep file map handle
PUSH 0,0,0,4h,EAX       ;eax=file map handle 4h=FILE_MAP_READ
CALL MapViewOfFile
MOV [FileInMemory],EAX  ;keep starting address of map view
;***************
;next function can analyse the contents of the file directly,
;starting at FileInMemory, for FileSize number of bytes
CALL FILELINEREAD       ;look at lines in the file directly in memory
;
PUSH [FileInMemory]
CALL UnmapViewOfFile    ;close file mapping for this file
PUSH [hFileMap]
CALL CloseHandle        ;close file map handle for this file
L1:
PUSH [hFile]
CALL CloseHandle        ;close the file
L2:
RET
;

Title: Re: Quicker file line read
Post by: jorgon on January 25, 2008, 12:43:58 PM

Here is a version which reads the whole file into memory instead of file mapping it.

Code Select

DATA
hFile           DD 0    ;to hold file handle
FileInMemory    DD 0    ;to hold starting address of file in memory
FileSize        DD 0    ;to hold the file size
TEMP_DWORD      DD 0    ;to receive the number of bytes read
;
CODE
;
LOADFOR_FILELINEREAD:
PUSH 0,80h,3            ;3=open existing
PUSH 0,1h,80000000h     ;open file with read access only
PUSH ADDR BUFFER        ;filename in Unicode (use CreateFileA if in ANSI)
CALL CreateFileW        ;try to open file
CMP EAX,-1              ;see if INVALID_HANDLE_VALUE
JZ >L2                  ;yes
MOV [hFile],EAX         ;keep file handle
PUSH 0,EAX              ;eax=file handle
CALL GetFileSize
CMP EAX,-1              ;see if failure
JZ >L1                  ;yes
MOV [FileSize],EAX      ;keep size of file
;
;**************** load the whole file into memory and read it there
PUSH 4h,3000h,EAX,0     ;reserve and commit eax bytes of read/write memory
CALL VirtualAlloc
OR EAX,EAX              ;see if failed
JZ >L1                  ;yes
MOV [FileInMemory],EAX  ;keep memory start
PUSH 0,ADDR TEMP_DWORD,[FileSize],EAX,[hFile]
CALL ReadFile           ;read FileSize bytes into eax
OR EAX,EAX              ;see if failed
JZ >L1                  ;yes
;***************
;next function can analyse the contents of the file directly,
;starting at FileInMemory, for FileSize number of bytes
CALL FILELINEREAD       ;look at lines in the file directly in memory
;
PUSH 8000h,0,[FileInMemory]
CALL VirtualFree        ;free memory used
L1:
PUSH [hFile]
CALL CloseHandle
L2:
RET

Instead of keeping the various handles in data labels, you could keep them in registers if you were careful to distinguish between the volatile and non-volatile registers and to save and restore them in FILELINEREAD.

I've used CALL instead of INVOKE in these two examples because I find it easier to add comments when using CALL.

Title: Re: Quicker file line read
Post by: ChrisLeslie on January 26, 2008, 01:30:56 AM

Here is my version so far, using a memory buffer, as a self-contained program. I have used an offset in the buffer to keep track of where reading is up to. I also zero terminated the buffer as a means to detect the end:

Code Select

DATA SECTION
  str1 db 40 dup ?      ; string to store each file line
  ptrBuff dd ?          ; pointer to buffer
  ptrBuffOffset dd ?    ; pointer to offset in the buffer
  
CODE SECTION
START:
  invoke BufferedFileOpenReadx,"junk.txt"
  cmp eax,-1
  je > exit                      ; can't open file
  mov [ptrBuff],eax
  mov [ptrBuffOffset],eax
L1:
  invoke BufferedFileReadLinex,[ptrBuffOffset],ADDR str1
  mov [ptrBuffOffset],ebx      ; update buffer offset
  cmp eax,0ah                  ; check for end of line
  je < L1                      ; jump back if end if line detected
  
  invoke GlobalFree,[ptrBuff]
exit:
  xor eax,eax
  call ExitProcess

;######################################################################
BufferedFileOpenReadx FRAME ptrFilename    
;buffer pointer goes to eax
;buffer size goes to ebx
;note - limited to ~ 4 gig file size
  LOCAL handle,dSize,dBytesRead,ptrBuffer
  push ecx
  push edx
  mov D[ptrBuffer],0
  mov D[dSize],0
  invoke CreateFileA,[ptrFilename],80000000h,1h,0,3,80h,0
  mov [handle],eax
  cmp eax,-1
  je > .exit2         ; can't open file
  invoke GetFileSize,[handle],0
  mov [dSize],eax
  add D[dSize],2     ;add extra bytes to the buffer
  invoke GlobalAlloc,40h,[dSize]
  mov [ptrBuffer],eax
  mov eax,[dSize]
  sub eax,2          ;calculate size of buffer to read
  invoke ReadFile,[handle],[ptrBuffer],eax,ADDR dBytesRead,0
  invoke CloseHandle,[handle]
  mov eax,[ptrBuffer]
  add eax,[dSize]
  mov W[eax-2],0     ;stamp two zero terminations at the end of buffer
.exit:
  mov eax,[ptrBuffer]
  mov ebx,[dSize]
.exit2:
  pop edx
  pop ecx
  ret
ENDF

;######################################################################
BufferedFileReadLinex FRAME dPosition,ptrString 
;last byte written goes to al - eax=0 is end of buffer
;new position for next line read goes to ebx
  push edx
  push esi
  push edi
  mov esi,[dPosition]
  mov edi,[ptrString]
  mov edx,-1
  xor eax,eax
.L1:
    inc edx
    mov al,B[esi+edx]
    cmp al,0
    je > .exitp
    cmp al,13
    je < .L1
    cmp al,10
    je > .exitp
    mov B[edi+edx],al    ; store character in string
    jmp < .L1
.exitp:
  mov B[edi+edx],0       ; zero terminate string
  add esi,edx
  inc esi
  mov ebx,esi            ; return new buffer position in ebx
  pop edi
  pop esi
  pop edx
  ret
ENDF

Comparing the speed between different versions of "read line" using a 10000 line text file on my old XP P2/256MB I get:
Open file then read byte-by-byte = about 771 ms
Open file then read byte in twos (like my original example) = about 401 ms
Buffered read as above = about 10 ms
Using Java BufferedReader = about 140 ms

Chris

Title: Re: Quicker file line read
Post by: Mark Jones on July 13, 2008, 04:17:06 PM

Nice work. Possibly silly question, how is buffer data handled internally by VirtualAlloc (and the other memory allocation schemas) -- is ANSI text which is read into a VirtualAlloc buffer converted to unicode internally, the same way that ANSI strings are all handled as unicode internally? (And thus slower than using native unicode functions?)

In other words, would a unicode source file be faster to load into a VirtualAlloc buffer than an ANSI file since it wouldn't need conversion?

Title: Re: Quicker file line read
Post by: ChrisLeslie on July 14, 2008, 04:37:35 AM

Mark -

Not a silly question. Unfortunately though I'll have to wait until next Christmas break before I could get a chance to experiment with it again, the way things are at present!

Chris

Title: Re: Quicker file line read
Post by: jj2007 on July 14, 2008, 08:07:54 AM

Code Select

.L1:
    inc edx
    mov al,B[esi+edx]
    cmp al,0
    je > .exitp
    cmp al,13
    je < .L1
    cmp al,10
    je > .exitp
    mov B[edi+edx],al    ; store character in string
    jmp < .L1

This is your speed critical part. From experience with InstrCI thread (http://www.masm32.com/board/index.php?topic=9370.msg68107#msg68107), it is faster to read a word from esi (byte and dword reads are slower; but dword is not an option here); I would also compare speed between

Code Select

mov ax, [esi]
add esi, 1

and

Code Select

mov ax, [esi+edx]
add ecx, 1

The MASM Forum Archive 2004 to 2012

Project Support Forums => GoAsm Assembler and Tools => Topic started by: ChrisLeslie on January 24, 2008, 12:01:39 AM