News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

The best way to process a file

Started by Sergiu FUNIERU, February 27, 2010, 01:25:00 AM

Previous topic - Next topic

Sergiu FUNIERU

I have a file, with letters and numbers. I want to split it in 2 separate files : one that will contain the numbers and the other one that will contain the letters.

I can read the file byte by byte, or I can read a chunk and process it, then move to the next chunk. What's the threshold? That is, how many bytes is optimal to read at once?

I saw on some programs that they allocate the space for the file content as ? . Why not using dynamic allocation?

clive

Doing FILEIO a byte at a time will be hideously slow. At a minimum you want a sector (512 bytes), or perhaps a cluster. Personally, I would do it 32KB at a time.

-Clive
It could be a random act of randomness. Those happen a lot as well.

Slugsnack

byte by byte would be your worst solution because of the overhead on each disk access. i personally would probably read either the cache size or page size

IMO, there is no good reason to have a static buffer allocated at compile time for things like this. i would definitely use dynamic allocation at runtime

however, for your particular situation, an implementation of concurrency is absolutely ideal. that is, have multiple threads synchronized to read different blocks and do their own processing. then combine their results at the end.

sinsi

If it's not too big (less than a gig) just load the whole thing at once, or use file mapping.
Light travels faster than sound, that's why some people seem bright until you hear them.

MichaelW

The attachment is a test that uses file mapping. I didn't bother to time any other methods, but splitting a 20 MB file, with simple code that processes one byte per loop, in 1.1 seconds on a 500MHz P3, seems fairly fast.
eschew obfuscation

sinsi

109 ms on a q6600 in xp  :bdg
I would imagine that another thread or three could cut the time.
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Between 47 and 67 milliseconds for splitting Windows.inc in three parts. Looks funny afterwards :bg
include \masm32\MasmBasic\MasmBasic.inc

.data?
wrtNumbers dd 1025/4 dup(?)
wrtLetters dd 1025/4 dup(?)
wrtOthers dd 1025/4 dup(?)
maxEsi dd ?
ctNumbers dd ?
ctLetters dd ?
ctOthers dd ?

.code
start:
push Timer
Let esi=FileRead$("\Masm32\include\Windows.inc")
mov maxEsi, Len(esi)
add maxEsi, esi
Open "O", #1, "Numbers.txt"
Open "O", #2, "Letters.txt"
Open "O", #3, "Others.txt"
mov ecx, offset wrtNumbers ; load pointers
mov ebx, offset wrtLetters ; to the three
mov edi, offset wrtOthers ; buffers
.Repeat
mov al, [esi] ; get the next byte from esi
.if al>="0" && al <="9"
mov [ecx], al ; put a number into the buffer
inc ecx
cmp ecx, offset wrtNumbers+1024 ; limit reached?
jl @F
mov ecx, offset wrtNumbers
Print #1:1024, ecx
.elseif (al>="A" && al<="Z") || (al>="a" && al<="z")
mov [ebx], al ; put a letter into the buffer
inc ebx
cmp ebx, offset wrtLetters+1024 ; limit reached?
jl @F
mov ebx, offset wrtLetters
Print #2:1024, ebx
.else
mov [edi], al ; put something else into the buffer
inc edi
cmp edi, offset wrtOthers+1024 ; limit reached?
jl @F
mov edi, offset wrtOthers
Print #3:1024, edi
.endif
@@: inc esi
.Until esi>=maxEsi
mov edx, ecx
mov ecx, offset wrtNumbers
sub edx, ecx
Print #1:edx, ecx ; write the remaining numbers
mov edx, ebx
mov ebx, offset wrtLetters
sub edx, ebx
Print #2:edx, ebx ; write the remaining letters
mov edx, edi
mov edi, offset wrtOthers
sub edx, edi
Print #3:edx, edi ; write the remaining other chars
Close
void Timer
pop edx
sub eax, edx
Print Str$("Splitting took %i ms\n", eax)
getkey
Exit

end start

sinsi

Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Michael's code runs in 200 ms on my machine, so it's definitely more efficient. I chose a small 1024 bytes buffer to demonstrate the streaming technique.

BlackVortex

First did 94ms, jj's did 16ms the first time and 15ms the second time.

GetTickCount may be inaccurate.

jj2007

Quote from: BlackVortex on February 27, 2010, 08:04:50 AM
First did 94ms, jj's did 16ms the first time and 15ms the second time.

GetTickCount may be inaccurate.

It's called granularity. But the first value is higher because windows.inc was not yet in the cache.

MichaelW

The effective resolution of the tick count is 10ms, so the uncertainty for two calls to GetTickCount is 20ms. Synchronizing with the tick count at the start of the timed period will cut the uncertainty in half.

This version splits windows.inc into three files in 80ms on my system:

;==============================================================================
    include \masm32\include\masm32rt.inc
;==============================================================================
    .data
        hFile1    dd 0
        hFile2    dd 0
        hFile3    dd 0
        hFile4    dd 0
        hMap1     dd 0
        hMap2     dd 0
        hMap3     dd 0
        hMap4     dd 0
        pMem1     dd 0
        pMem2     dd 0
        pMem3     dd 0
        pMem4     dd 0
        cnt2      dd 0
        cnt3      dd 0
        cnt4      dd 0
        fileSize  dd 0
    .code
;==============================================================================
start:
;==============================================================================

    invoke Sleep, 3000

    invoke GetTickCount
    mov ebx, eax
    .WHILE ebx == eax
        invoke GetTickCount
    .ENDW
    push eax

    ;----------------------------------
    ; Open/create the necessary files.
    ;----------------------------------

    invoke CreateFile, chr$("\masm32\include\windows.inc"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile1, eax
    invoke CreateFile, chr$("numbers.dat"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       CREATE_ALWAYS,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile2, eax
    invoke CreateFile, chr$("letters.dat"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       CREATE_ALWAYS,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile3, eax
    invoke CreateFile, chr$("others.dat"),
                       GENERIC_READ or GENERIC_WRITE,
                       FILE_SHARE_READ or FILE_SHARE_WRITE,
                       NULL,
                       CREATE_ALWAYS,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL
    mov hFile4, eax

    mov fileSize, fsize(hFile1)

    ;--------------------------------------------------------------
    ; Create an unnamed file mapping object for each of the files.
    ; The maximum size of the first will be the size of the file.
    ; For the other the size of the file is zero, so an appropriate
    ; maximum size must be specified.
    ;--------------------------------------------------------------

    invoke CreateFileMapping, hFile1,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              0,
                              NULL
    mov hMap1, eax
    invoke CreateFileMapping, hFile2,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              fileSize,
                              NULL
    mov hMap2, eax
    invoke CreateFileMapping, hFile3,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              fileSize,
                              NULL
    mov hMap3, eax
    invoke CreateFileMapping, hFile4,
                              NULL,
                              PAGE_READWRITE,
                              0,
                              fileSize,
                              NULL
    mov hMap4, eax

    ;---------------------------------------------------------
    ; Map a view of each of the files into our address space.
    ; The return value is the starting address of the mapped
    ; view.
    ;---------------------------------------------------------

    invoke MapViewOfFile, hMap1, FILE_MAP_WRITE, 0, 0, 0
    mov pMem1, eax
    invoke MapViewOfFile, hMap2, FILE_MAP_WRITE, 0, 0, 0
    mov pMem2, eax
    invoke MapViewOfFile, hMap3, FILE_MAP_WRITE, 0, 0, 0
    mov pMem3, eax
    invoke MapViewOfFile, hMap4, FILE_MAP_WRITE, 0, 0, 0
    mov pMem4, eax

    ;------------------------------------------------
    ; Scan the sample a byte at a time and copy each
    ; byte to the appropriate mapped view.
    ;------------------------------------------------

    mov esi, pMem1
    mov ebx, pMem2
    mov ecx, pMem3
    mov edx, pMem4
    xor edi, edi
    .WHILE edi < fileSize
        mov al, BYTE PTR [esi+edi]
        inc edi
        .IF al >= '0' && al <= '9'
            mov [ebx], al
            inc ebx
        .ELSEIF al >= 'A' && al <= 'Z' || al >= 'a' && al <= 'z'
            mov [ecx], al
            inc ecx
        .ELSE
            mov [edx], al
            inc edx
        .ENDIF
    .ENDW

    sub ebx, pMem2
    mov cnt2, ebx
    sub ecx, pMem3
    mov cnt3, ecx
    sub edx, pMem4
    mov cnt4, edx

    ;-----------------------------------------------------------
    ; For SetEndOfFile to work we must first unmap the views of
    ; the files and close the file mapping object handles.
    ;-----------------------------------------------------------

    invoke UnmapViewOfFile, pMem1
    invoke UnmapViewOfFile, pMem2
    invoke UnmapViewOfFile, pMem3
    invoke UnmapViewOfFile, pMem4

    invoke CloseHandle, hMap1
    invoke CloseHandle, hMap2
    invoke CloseHandle, hMap3
    invoke CloseHandle, hMap4

    ;---------------------------------------------------
    ; Truncate the output files at the end of the data.
    ;---------------------------------------------------

    invoke SetFilePointer, hFile2, cnt2, 0, FILE_BEGIN
    invoke SetFilePointer, hFile3, cnt3, 0, FILE_BEGIN
    invoke SetFilePointer, hFile4, cnt4, 0, FILE_BEGIN
    invoke SetEndOfFile, hFile2
    invoke SetEndOfFile, hFile3
    invoke SetEndOfFile, hFile4

    fclose hFile1
    fclose hFile2
    fclose hFile3
    fclose hFile4

    invoke GetTickCount
    pop edx
    sub eax, edx

    print str$(eax)," ms",13,10,13,10

    inkey "Press any key to exit..."
    exit

;==============================================================================
end start


eschew obfuscation

jj2007

> This version splits windows.inc into three files in 80ms on my system

Michael,
I can't convince your code to run in more than 0 ms on my Celeron M...
:wink

hutch--

 :bg

> GetTickCount may be inaccurate.

Nothing is accurate in ring3 at this duration. GetTickCount() becomes useful over about 200ms but much below that the results are nonsense. You can use higher resolution methods but in ring3 they will all be about as useless unless you set up a test piece that runs for about half a second and then you strt getting down to the low percentage points.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php