The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: Sergiu FUNIERU on February 27, 2010, 01:25:00 AM

Title: The best way to process a file
Post by: Sergiu FUNIERU on February 27, 2010, 01:25:00 AM
I have a file, with letters and numbers. I want to split it in 2 separate files : one that will contain the numbers and the other one that will contain the letters.

I can read the file byte by byte, or I can read a chunk and process it, then move to the next chunk. What's the threshold? That is, how many bytes is optimal to read at once?

I saw on some programs that they allocate the space for the file content as ? . Why not using dynamic allocation?
Title: Re: The best way to process a file
Post by: clive on February 27, 2010, 01:32:05 AM
Doing FILEIO a byte at a time will be hideously slow. At a minimum you want a sector (512 bytes), or perhaps a cluster. Personally, I would do it 32KB at a time.

-Clive
Title: Re: The best way to process a file
Post by: Slugsnack on February 27, 2010, 01:44:59 AM
byte by byte would be your worst solution because of the overhead on each disk access. i personally would probably read either the cache size or page size

IMO, there is no good reason to have a static buffer allocated at compile time for things like this. i would definitely use dynamic allocation at runtime

however, for your particular situation, an implementation of concurrency is absolutely ideal. that is, have multiple threads synchronized to read different blocks and do their own processing. then combine their results at the end.
Title: Re: The best way to process a file
Post by: sinsi on February 27, 2010, 02:34:36 AM
If it's not too big (less than a gig) just load the whole thing at once, or use file mapping.
Title: Re: The best way to process a file
Post by: MichaelW on February 27, 2010, 07:16:05 AM
The attachment is a test that uses file mapping. I didn't bother to time any other methods, but splitting a 20 MB file, with simple code that processes one byte per loop, in 1.1 seconds on a 500MHz P3, seems fairly fast.
Title: Re: The best way to process a file
Post by: sinsi on February 27, 2010, 07:38:15 AM
109 ms on a q6600 in xp  :bdg
I would imagine that another thread or three could cut the time.
Title: Re: The best way to process a file
Post by: jj2007 on February 27, 2010, 07:50:05 AM
Between 47 and 67 milliseconds for splitting Windows.inc in three parts. Looks funny afterwards :bg
include \masm32\MasmBasic\MasmBasic.inc

.data?
wrtNumbers dd 1025/4 dup(?)
wrtLetters dd 1025/4 dup(?)
wrtOthers dd 1025/4 dup(?)
maxEsi dd ?
ctNumbers dd ?
ctLetters dd ?
ctOthers dd ?

.code
start:
push Timer
Let esi=FileRead$("\Masm32\include\Windows.inc")
mov maxEsi, Len(esi)
add maxEsi, esi
Open "O", #1, "Numbers.txt"
Open "O", #2, "Letters.txt"
Open "O", #3, "Others.txt"
mov ecx, offset wrtNumbers ; load pointers
mov ebx, offset wrtLetters ; to the three
mov edi, offset wrtOthers ; buffers
.Repeat
mov al, [esi] ; get the next byte from esi
.if al>="0" && al <="9"
mov [ecx], al ; put a number into the buffer
inc ecx
cmp ecx, offset wrtNumbers+1024 ; limit reached?
jl @F
mov ecx, offset wrtNumbers
Print #1:1024, ecx
.elseif (al>="A" && al<="Z") || (al>="a" && al<="z")
mov [ebx], al ; put a letter into the buffer
inc ebx
cmp ebx, offset wrtLetters+1024 ; limit reached?
jl @F
mov ebx, offset wrtLetters
Print #2:1024, ebx
.else
mov [edi], al ; put something else into the buffer
inc edi
cmp edi, offset wrtOthers+1024 ; limit reached?
jl @F
mov edi, offset wrtOthers
Print #3:1024, edi
.endif
@@: inc esi
.Until esi>=maxEsi
mov edx, ecx
mov ecx, offset wrtNumbers
sub edx, ecx
Print #1:edx, ecx ; write the remaining numbers
mov edx, ebx
mov ebx, offset wrtLetters
sub edx, ebx
Print #2:edx, ebx ; write the remaining letters
mov edx, edi
mov edi, offset wrtOthers
sub edx, edi
Print #3:edx, edi ; write the remaining other chars
Close
void Timer
pop edx
sub eax, edx
Print Str$("Splitting took %i ms\n", eax)
getkey
Exit

end start
Title: Re: The best way to process a file
Post by: sinsi on February 27, 2010, 07:54:26 AM
93 ms jj.
Title: Re: The best way to process a file
Post by: jj2007 on February 27, 2010, 08:01:38 AM
Michael's code runs in 200 ms on my machine, so it's definitely more efficient. I chose a small 1024 bytes buffer to demonstrate the streaming technique.
Title: Re: The best way to process a file
Post by: BlackVortex on February 27, 2010, 08:04:50 AM
First did 94ms, jj's did 16ms the first time and 15ms the second time.

GetTickCount may be inaccurate.
Title: Re: The best way to process a file
Post by: jj2007 on February 27, 2010, 08:55:55 AM
Quote from: BlackVortex on February 27, 2010, 08:04:50 AM
First did 94ms, jj's did 16ms the first time and 15ms the second time.

GetTickCount may be inaccurate.

It's called granularity. But the first value is higher because windows.inc was not yet in the cache.
Title: Re: The best way to process a file
Post by: MichaelW on February 27, 2010, 09:20:32 AM
The effective resolution of the tick count is 10ms, so the uncertainty for two calls to GetTickCount is 20ms. Synchronizing with the tick count at the start of the timed period will cut the uncertainty in half.

This version splits windows.inc into three files in 80ms on my system:

;==============================================================================
    include \masm32\include\masm32rt.inc
;==============================================================================
    .data
        hFile1    dd 0
        hFile2    dd 0
        hFile3    dd 0
        hFile4    dd 0
        hMap1     dd 0
        hMap2     dd 0
        hMap3     dd 0
        hMap4     dd 0
        pMem1     dd 0
        pMem2     dd 0
        pMem3     dd 0
        pMem4     dd 0
        cnt2      dd 0
        cnt3      dd 0
        cnt4      dd 0
        fileSize  dd 0
    .code
;==============================================================================
start:
;==============================================================================

    invoke Sleep, 3000

    invoke GetTickCount
    mov ebx, eax
    .WHILE ebx == eax
        invoke GetTickCount
    .ENDW
    push eax

    ;----------------------------------
    ; Open/create the necessary files.
    ;----------------------------------

    invoke CreateFile, chr$("\masm32\include\windows.inc"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile1, eax
    invoke CreateFile, chr$("numbers.dat"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       CREATE_ALWAYS,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile2, eax
    invoke CreateFile, chr$("letters.dat"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       CREATE_ALWAYS,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile3, eax
    invoke CreateFile, chr$("others.dat"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       CREATE_ALWAYS,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile4, eax

    mov fileSize, fsize(hFile1)

    ;--------------------------------------------------------------
    ; Create an unnamed file mapping object for each of the files.
    ; The maximum size of the first will be the size of the file.
    ; For the other the size of the file is zero, so an appropriate
    ; maximum size must be specified.
    ;--------------------------------------------------------------

    invoke CreateFileMapping, hFile1,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              0,
                              NULL
    mov hMap1, eax
    invoke CreateFileMapping, hFile2,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              fileSize,
                              NULL
    mov hMap2, eax
    invoke CreateFileMapping, hFile3,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              fileSize,
                              NULL
    mov hMap3, eax
    invoke CreateFileMapping, hFile4,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              fileSize,
                              NULL
    mov hMap4, eax

    ;---------------------------------------------------------
    ; Map a view of each of the files into our address space.
    ; The return value is the starting address of the mapped
    ; view.
    ;---------------------------------------------------------

    invoke MapViewOfFile, hMap1, FILE_MAP_WRITE, 0, 0, 0
    mov pMem1, eax
    invoke MapViewOfFile, hMap2, FILE_MAP_WRITE, 0, 0, 0
    mov pMem2, eax
    invoke MapViewOfFile, hMap3, FILE_MAP_WRITE, 0, 0, 0
    mov pMem3, eax
    invoke MapViewOfFile, hMap4, FILE_MAP_WRITE, 0, 0, 0
    mov pMem4, eax

    ;------------------------------------------------
    ; Scan the sample a byte at a time and copy each
    ; byte to the appropriate mapped view.
    ;------------------------------------------------

    mov esi, pMem1
    mov ebx, pMem2
    mov ecx, pMem3
    mov edx, pMem4
    xor edi, edi
    .WHILE edi < fileSize
        mov al, BYTE PTR [esi+edi]
        inc edi
        .IF al >= '0' && al <= '9'
            mov [ebx], al
            inc ebx
        .ELSEIF al >= 'A' && al <= 'Z' || al >= 'a' && al <= 'z'
            mov [ecx], al
            inc ecx
        .ELSE
            mov [edx], al
            inc edx
        .ENDIF
    .ENDW

    sub ebx, pMem2
    mov cnt2, ebx
    sub ecx, pMem3
    mov cnt3, ecx
    sub edx, pMem4
    mov cnt4, edx

    ;-----------------------------------------------------------
    ; For SetEndOfFile to work we must first unmap the views of
    ; the files and close the file mapping object handles.
    ;-----------------------------------------------------------

    invoke UnmapViewOfFile, pMem1
    invoke UnmapViewOfFile, pMem2
    invoke UnmapViewOfFile, pMem3
    invoke UnmapViewOfFile, pMem4

    invoke CloseHandle, hMap1
    invoke CloseHandle, hMap2
    invoke CloseHandle, hMap3
    invoke CloseHandle, hMap4

    ;---------------------------------------------------
    ; Truncate the output files at the end of the data.
    ;---------------------------------------------------

    invoke SetFilePointer, hFile2, cnt2, 0, FILE_BEGIN
    invoke SetFilePointer, hFile3, cnt3, 0, FILE_BEGIN
    invoke SetFilePointer, hFile4, cnt4, 0, FILE_BEGIN
    invoke SetEndOfFile, hFile2
    invoke SetEndOfFile, hFile3
    invoke SetEndOfFile, hFile4

    fclose hFile1
    fclose hFile2
    fclose hFile3
    fclose hFile4

    invoke GetTickCount
    pop edx
    sub eax, edx

    print str$(eax)," ms",13,10,13,10

    inkey "Press any key to exit..."
    exit

;==============================================================================
end start


Title: Re: The best way to process a file
Post by: jj2007 on February 27, 2010, 09:26:18 AM
> This version splits windows.inc into three files in 80ms on my system

Michael,
I can't convince your code to run in more than 0 ms on my Celeron M...
:wink
Title: Re: The best way to process a file
Post by: hutch-- on February 27, 2010, 10:34:36 AM
 :bg

> GetTickCount may be inaccurate.

Nothing is accurate in ring3 at this duration. GetTickCount() becomes useful over about 200ms but much below that the results are nonsense. You can use higher resolution methods but in ring3 they will all be about as useless unless you set up a test piece that runs for about half a second and then you strt getting down to the low percentage points.