I have a file, with letters and numbers. I want to split it in 2 separate files : one that will contain the numbers and the other one that will contain the letters.
I can read the file byte by byte, or I can read a chunk and process it, then move to the next chunk. What's the threshold? That is, how many bytes is optimal to read at once?
I saw on some programs that they allocate the space for the file content as ? . Why not using dynamic allocation?
Doing FILEIO a byte at a time will be hideously slow. At a minimum you want a sector (512 bytes), or perhaps a cluster. Personally, I would do it 32KB at a time.
-Clive
byte by byte would be your worst solution because of the overhead on each disk access. i personally would probably read either the cache size or page size
IMO, there is no good reason to have a static buffer allocated at compile time for things like this. i would definitely use dynamic allocation at runtime
however, for your particular situation, an implementation of concurrency is absolutely ideal. that is, have multiple threads synchronized to read different blocks and do their own processing. then combine their results at the end.
If it's not too big (less than a gig) just load the whole thing at once, or use file mapping.
The attachment is a test that uses file mapping. I didn't bother to time any other methods, but splitting a 20 MB file, with simple code that processes one byte per loop, in 1.1 seconds on a 500MHz P3, seems fairly fast.
109 ms on a q6600 in xp :bdg
I would imagine that another thread or three could cut the time.
Between 47 and 67 milliseconds for splitting Windows.inc in three parts. Looks funny afterwards :bg
include \masm32\MasmBasic\MasmBasic.inc
.data?
wrtNumbers dd 1025/4 dup(?)
wrtLetters dd 1025/4 dup(?)
wrtOthers dd 1025/4 dup(?)
maxEsi dd ?
ctNumbers dd ?
ctLetters dd ?
ctOthers dd ?
.code
start:
push Timer
Let esi=FileRead$("\Masm32\include\Windows.inc")
mov maxEsi, Len(esi)
add maxEsi, esi
Open "O", #1, "Numbers.txt"
Open "O", #2, "Letters.txt"
Open "O", #3, "Others.txt"
mov ecx, offset wrtNumbers ; load pointers
mov ebx, offset wrtLetters ; to the three
mov edi, offset wrtOthers ; buffers
.Repeat
mov al, [esi] ; get the next byte from esi
.if al>="0" && al <="9"
mov [ecx], al ; put a number into the buffer
inc ecx
cmp ecx, offset wrtNumbers+1024 ; limit reached?
jl @F
mov ecx, offset wrtNumbers
Print #1:1024, ecx
.elseif (al>="A" && al<="Z") || (al>="a" && al<="z")
mov [ebx], al ; put a letter into the buffer
inc ebx
cmp ebx, offset wrtLetters+1024 ; limit reached?
jl @F
mov ebx, offset wrtLetters
Print #2:1024, ebx
.else
mov [edi], al ; put something else into the buffer
inc edi
cmp edi, offset wrtOthers+1024 ; limit reached?
jl @F
mov edi, offset wrtOthers
Print #3:1024, edi
.endif
@@: inc esi
.Until esi>=maxEsi
mov edx, ecx
mov ecx, offset wrtNumbers
sub edx, ecx
Print #1:edx, ecx ; write the remaining numbers
mov edx, ebx
mov ebx, offset wrtLetters
sub edx, ebx
Print #2:edx, ebx ; write the remaining letters
mov edx, edi
mov edi, offset wrtOthers
sub edx, edi
Print #3:edx, edi ; write the remaining other chars
Close
void Timer
pop edx
sub eax, edx
Print Str$("Splitting took %i ms\n", eax)
getkey
Exit
end start
93 ms jj.
Michael's code runs in 200 ms on my machine, so it's definitely more efficient. I chose a small 1024 bytes buffer to demonstrate the streaming technique.
First did 94ms, jj's did 16ms the first time and 15ms the second time.
GetTickCount may be inaccurate.
Quote from: BlackVortex on February 27, 2010, 08:04:50 AM
First did 94ms, jj's did 16ms the first time and 15ms the second time.
GetTickCount may be inaccurate.
It's called granularity. But the first value is higher because windows.inc was not yet in the cache.
The effective resolution of the tick count is 10ms, so the uncertainty for two calls to GetTickCount is 20ms. Synchronizing with the tick count at the start of the timed period will cut the uncertainty in half.
This version splits windows.inc into three files in 80ms on my system:
;==============================================================================
include \masm32\include\masm32rt.inc
;==============================================================================
.data
hFile1 dd 0
hFile2 dd 0
hFile3 dd 0
hFile4 dd 0
hMap1 dd 0
hMap2 dd 0
hMap3 dd 0
hMap4 dd 0
pMem1 dd 0
pMem2 dd 0
pMem3 dd 0
pMem4 dd 0
cnt2 dd 0
cnt3 dd 0
cnt4 dd 0
fileSize dd 0
.code
;==============================================================================
start:
;==============================================================================
invoke Sleep, 3000
invoke GetTickCount
mov ebx, eax
.WHILE ebx == eax
invoke GetTickCount
.ENDW
push eax
;----------------------------------
; Open/create the necessary files.
;----------------------------------
invoke CreateFile, chr$("\masm32\include\windows.inc"),
GENERIC_READ or GENERIC_WRITE,
FILE_SHARE_READ or FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL
mov hFile1, eax
invoke CreateFile, chr$("numbers.dat"),
GENERIC_READ or GENERIC_WRITE,
FILE_SHARE_READ or FILE_SHARE_WRITE,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
mov hFile2, eax
invoke CreateFile, chr$("letters.dat"),
GENERIC_READ or GENERIC_WRITE,
FILE_SHARE_READ or FILE_SHARE_WRITE,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
mov hFile3, eax
invoke CreateFile, chr$("others.dat"),
GENERIC_READ or GENERIC_WRITE,
FILE_SHARE_READ or FILE_SHARE_WRITE,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
mov hFile4, eax
mov fileSize, fsize(hFile1)
;--------------------------------------------------------------
; Create an unnamed file mapping object for each of the files.
; The maximum size of the first will be the size of the file.
; For the other the size of the file is zero, so an appropriate
; maximum size must be specified.
;--------------------------------------------------------------
invoke CreateFileMapping, hFile1,
NULL,
PAGE_READWRITE,
0,
0,
NULL
mov hMap1, eax
invoke CreateFileMapping, hFile2,
NULL,
PAGE_READWRITE,
0,
fileSize,
NULL
mov hMap2, eax
invoke CreateFileMapping, hFile3,
NULL,
PAGE_READWRITE,
0,
fileSize,
NULL
mov hMap3, eax
invoke CreateFileMapping, hFile4,
NULL,
PAGE_READWRITE,
0,
fileSize,
NULL
mov hMap4, eax
;---------------------------------------------------------
; Map a view of each of the files into our address space.
; The return value is the starting address of the mapped
; view.
;---------------------------------------------------------
invoke MapViewOfFile, hMap1, FILE_MAP_WRITE, 0, 0, 0
mov pMem1, eax
invoke MapViewOfFile, hMap2, FILE_MAP_WRITE, 0, 0, 0
mov pMem2, eax
invoke MapViewOfFile, hMap3, FILE_MAP_WRITE, 0, 0, 0
mov pMem3, eax
invoke MapViewOfFile, hMap4, FILE_MAP_WRITE, 0, 0, 0
mov pMem4, eax
;------------------------------------------------
; Scan the sample a byte at a time and copy each
; byte to the appropriate mapped view.
;------------------------------------------------
mov esi, pMem1
mov ebx, pMem2
mov ecx, pMem3
mov edx, pMem4
xor edi, edi
.WHILE edi < fileSize
mov al, BYTE PTR [esi+edi]
inc edi
.IF al >= '0' && al <= '9'
mov [ebx], al
inc ebx
.ELSEIF al >= 'A' && al <= 'Z' || al >= 'a' && al <= 'z'
mov [ecx], al
inc ecx
.ELSE
mov [edx], al
inc edx
.ENDIF
.ENDW
sub ebx, pMem2
mov cnt2, ebx
sub ecx, pMem3
mov cnt3, ecx
sub edx, pMem4
mov cnt4, edx
;-----------------------------------------------------------
; For SetEndOfFile to work we must first unmap the views of
; the files and close the file mapping object handles.
;-----------------------------------------------------------
invoke UnmapViewOfFile, pMem1
invoke UnmapViewOfFile, pMem2
invoke UnmapViewOfFile, pMem3
invoke UnmapViewOfFile, pMem4
invoke CloseHandle, hMap1
invoke CloseHandle, hMap2
invoke CloseHandle, hMap3
invoke CloseHandle, hMap4
;---------------------------------------------------
; Truncate the output files at the end of the data.
;---------------------------------------------------
invoke SetFilePointer, hFile2, cnt2, 0, FILE_BEGIN
invoke SetFilePointer, hFile3, cnt3, 0, FILE_BEGIN
invoke SetFilePointer, hFile4, cnt4, 0, FILE_BEGIN
invoke SetEndOfFile, hFile2
invoke SetEndOfFile, hFile3
invoke SetEndOfFile, hFile4
fclose hFile1
fclose hFile2
fclose hFile3
fclose hFile4
invoke GetTickCount
pop edx
sub eax, edx
print str$(eax)," ms",13,10,13,10
inkey "Press any key to exit..."
exit
;==============================================================================
end start
> This version splits windows.inc into three files in 80ms on my system
Michael,
I can't convince your code to run in more than 0 ms on my Celeron M...
:wink
:bg
> GetTickCount may be inaccurate.
Nothing is accurate in ring3 at this duration. GetTickCount() becomes useful over about 200ms but much below that the results are nonsense. You can use higher resolution methods but in ring3 they will all be about as useless unless you set up a test piece that runs for about half a second and then you strt getting down to the low percentage points.