I needed a file compare procedure so I used this as an excuse to compare the run time for a conventional version that uses buffers allocated from the heap, to the run time for a version that uses file mapping. On my Windows 2000/P3 system, using the same windows.inc for both files, the version that uses file mapping runs in ~59% of the time required by the conventional version.
Based on these results, I would expect file mapping to produce similar decreases in run time for the MASM32 procedures that read and write files. Except for a potential problem with the MapViewOfFile (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/fs/mapviewoffile.asp) function and a swap file that cannot get larger (unusual circumstances?), the required functions appear to be supported starting with Windows 95.
[attachment deleted by admin]
Interesting!
Quote from: AMD XP 2500+ / WinXP
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1045ms
filecmp_filemapping: 508ms
48.6%
Results oscillating max +- 0.6%...
Quote from: AMD XP 3000+ / WinXP SP2correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 962ms
filecmp_filemapping: 222ms
23.1%
For P4 2Ghz, W2KSP4
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1582ms
filecmp_filemapping: 3281ms
Press any key to exit...
Regards, P1 :8)
I wrote a simple file-joiner program recently. I was comparing its speed against a standard and widely-used utility, HJSplit, and despite it being absolutely stripped-down to bare ASM wrapping round the API calls it was depressingly slower by around 10%, whether I used synchronous or asynchronous code. I was advised to try file-mapping, on the basis that the buffering into memory was all being done by the kernel and should therefore be faster than standard read/writes to buffer, but while it made a bit of difference it still wasn't beating HJSplit on the join I was testing.
What allowed me to finally beat HJSplit by about 10% on speed was to stick with standard I/O but use the FILE_FLAG_NO_BUFFERING flag, to my surprise. It needs a little more care dealing with file-ends because of the need to pull-sector sized reads at all times, but it's by far the fastest way of doing I/O, even over file-mapping acording to the testing I did with my app.
Ian_B
This attachment this time includes a no-buffering version. On my Windows2000/P3-500 system the run time for the no-buffering version varies more than the run times for the other versions, with an average that is somewhere close to the run time for the heap buffer version.
[attachment deleted by admin]
xp sp2 1.4GHz
filecmp
Quotecorrect return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 845ms
filecmp_filemapping: 140ms
Press any key to exit...
16.6%
filecmp2
Quotecorrect return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_heapbuffers: 863ms
filecmp_filemapping: 141ms
filecmp_nobuffering: 2840ms
Press any key to exit...
16.3%
I have now added a no-buffering asynchronous version, and while it is consistently faster than the heap buffer version, it does not come close to the file mapping version.
EDIT: Replaced attachment with new version
http://www.masmforum.com/simple/index.php?topic=4768.msg35790#msg35790
[attachment deleted by admin]
Athlon 64 3000+, XP Pro SP2 (32b):
filecmp3:
Quotecorrect return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1
filecmp_heapbuffers: 991ms
filecmp_filemapping: 197ms
19.9%
filecmp_nobuffering: 6382ms
filecmp_nobuffering_async: 8788ms
Quotecorrect return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1
filecmp_heapbuffers: 909ms
filecmp_filemapping: 141ms
filecmp_nobuffering: 2844ms
filecmp_nobuffering_async: 2888ms
Press any key to exit...
15.5%
Quote from: Athlon 64 3000+ / WinXP SP2correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1
filecmp_heapbuffers: 967ms
filecmp_filemapping: 225ms
23.3%
filecmp_nobuffering: 2133ms
filecmp_nobuffering_async: 2093ms
Same CPU and OS as Ghirai, but very different results for filecmp_nobuffering versions?
Athlon 1190 Mhz, Windows XP SP2
Quote
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1
filecmp_heapbuffers: 2307ms
filecmp_filemapping: 487ms
21.1%
filecmp_nobuffering: 4310ms
filecmp_nobuffering_async: 4125ms
Athlon 64 Mobile 3400+, Windows XP SP2:
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1
filecmp_heapbuffers: 804ms
filecmp_filemapping: 176ms
21.9%
filecmp_nobuffering: 2780ms
filecmp_nobuffering_async: 2514ms
Ossa
Quote from: Phoenix on May 13, 2006, 01:38:10 PM
Same CPU and OS as Ghirai, but very different results for filecmp_nobuffering versions?
I suspect the no-buffering versions are being strongly affected by hard disk performance, and the buffered versions strongly affected by CPU/memory performance. My system has a relatively fast hard disk (a Western Digital JB model with an 8MB cache), and a relatively slow CPU and memory (a 500MHz P3 and SDR, single channel, 133MHz SDRAM), so as a result the heap buffer and no-buffering versions have similar run times.
filecmp_heapbuffers: 5554ms
filecmp_filemapping: 3290ms
59.2%
filecmp_nobuffering: 5326ms
filecmp_nobuffering_async: 5231ms
Compared to my system, these newer systems have
much higher CPU/memory performance, but may or may not have higher hard disk performance. Perhaps Ghirai's system has a relatively slow hard disk.
Testing on my system I was somewhat disappointed that the file mapping version was less than twice as fast as the conventional version. I was expecting 3-4 times faster, as it is for the newer/faster systems.
Michael
I'd note a few things briefly from looking at your code. First, the no-buffering version doesn't actually compare ALL the files, you truncate rather than over-running the end to the next sector boundary, so it's not actually doing the same thing as the other two procs. Second, you've hard-coded the sector multiple as 512, which may or may not be an appropriate number fudge. The correct way to find it, as I'm sure you know, is on an individual disk basis using GetDiskFreeSpace (it's the second returned parameter, Bytes Per Sector).
The third thing is that you are also comparing, rightly or wrongly, the varying speeds of allocating the memory, which has been thrashed over before. By using VirtualAlloc you are always going to get a slower result than using the heap functions, surely? It's hardly a fair comparison therefore. I realise this was part of the point of your test, though, to find the speed of the routine when using a heap memory allocation.
For what it's worth, in my app I isolated the memory allocation from the I/O. Because I am writing apps that assume they will do a lot of I/O tasks, I don't cripple them by allocating buffer memory on an as-needed basis every time. I make a 2Mb fixed buffer (generally used as a 1Mb read and a 1Mb write buffer) with VirtualAlloc on startup and reuse that. It's guaranteed sector-multiple-sized and only needs creating once, and I always know where it is. :P That is probably why my app speeded up the no-buffering I/O compared to the filemapping functions which must surely do their own allocation "under the hood".
I should, therefore, probably qualify my prevous comment by saying my experience shows that if you already have an appropriately aligned buffer you can use, the no-buffering I/O should be faster from an end-user POV. Making a single buffer once on startup for reuse is a better real-world optimisation for the entire app, looking at it from a wider perspective than this macro-oriented local usage. The extra time taken for memory allocation is also subsumed into startup time where the end user won't "notice" it as much as when it's part of a work procedure.
Ian_B
Thanks Ian, I was hoping you would take a look at the code and comment on it. The truncation, the choice of using virtual allocate, and the fixed sector size were all just sloppy shortcuts to save time. My focus was on reasonably verifying that file mapping is the best method for this application. I was assuming that the normal sector size is still 512 bytes, is this no longer so?
I modified the filecmp_nobuffering and filecmp_nobuffering_async procedures to allocate from the heap and calculate and use sector-aligned start addresses for the buffers. As a matter of convenience I am still truncating the files to a multiple of the sector size, but given the size of the test file this should not have any significant effect. On my system filecmp_nobuffering is a little faster and filecmp_nobuffering_async much slower. I have replaced the last attachment with the new version.
OK, this is bugging me. I've rewritten the non-async no_buffering code to strip out all the slow shifts (the sector-alignment can be done with ANDs using a MOD value made from the sector-size -1) and I still can't get the elapsed time down from nearly 6000ms compared to the simple heap code at 1400ms. That's even with the sector-size check done out of the loop. I can't believe those few extra instructions are causing such an enormous slow-up, apart from those it's doing exactly the same as the simple code, allocating memory X 2, reading in 2 files to memory and doing exactly the same compare. :eek
However... logic is now returning and is telling me that the reason may be the simple fact of reading using FILE_FLAG_NO_BUFFERING. If you do 100 reads in a loop WITHOUT using that flag, the kernel will cache it, and it will not read from disk 99 of those times but will fetch from memory. And when you test the filemapping code straight after it doesn't even have to read it once from disk! :bdg Whereas the no_buffered version almost certainly has no such luck. Doing the test using multiple repeat reads of the same file is not going to work to give a clear answer of which may be faster in general usage, when you tend to read a file only once. :wink
Maybe for a better test copy the file 40 times and rename the copies with a split number. (This could be automated in the app set-up, just make sure the files are written no-buffered!! Don't forget to delete afterwards...) Then either put all these names in an array and work through them using a pointer that updates by the length of the name, or use one name and update the split number in it between tests. That way you can do 10 tests per proc (comparing against the original file) and every test gets a different but identical file that won't be cached. Doing it this way means you can simplify the code to only read in the test file too, you may as well keep the original in memory through the tests.
For interest, here's my version of the sector-alignment code, it works identically for rounding up the memory allocation startpoint. Note that there was an error in your code, you released the wrong memory pointers in the no-buffering section (the rounded numbers, not the membase values) and in the two no-buffering timer loops there's a superfluous PUSH EAX.
invoke GetDiskFreeSpace,NULL, ; inside or outside loop
ADDR junkreturn,
ADDR bytesPerSector,
ADDR junkreturn,
ADDR junkreturn
...
mov ebx, bytesPerSector
sub ebx, 1 ; make MOD value from power of 2 returned
; Get file size and extend to multiple of sector size
invoke GetFileSize,hFile1,NULL
mov ecx, ebx
mov lof1, eax ; save real length of file, use for compare function
test eax, ebx ; is there a remainder
jz @F
not ecx
add eax, ebx ; add sector-length - 1
and eax, ecx ; round up to next sector-length
@@:
mov secbytes1, eax ; save correct sector-multiple read length
; you need this value as an extra LOCAL, must be same for second file
invoke GetFileSize,hFile2,NULL
cmp lof1, eax
jne quit_nobuffer
; but you can save yourself lof2 as this must be identical to lof1
; Allocate memory and calc sector-aligned start address
invoke GetProcessHeap
mov ecx, secbytes1
xor edx, edx
add ecx, ebx ; max needed memsize is rounded length + (sector-1)
push ecx ; set params now for next HeapAlloc call
push edx
push eax
invoke HeapAlloc,eax,edx,ecx
mov memBase1, eax
mov ecx, ebx
test eax, ebx ; is mem allocation sector-aligned
jz @F
not ecx
add eax, ebx ; add sector-length - 1
and eax, ecx ; round up to next sector-multiple start
@@:
mov lpMem1, eax
call HeapAlloc ; params already on stack
mov memBase2, eax
mov ecx, ebx
test eax, ebx ; is mem allocation sector-aligned
jz @F
not ecx
add eax, ebx ; add sector-length - 1
and eax, ecx ; round up to next sector-multiple start
@@:
mov lpMem2, eax
invoke ReadFile,hFile1,lpMem1,secbytes1,ADDR bytesRead,NULL
invoke ReadFile,hFile2,lpMem2,secbytes1,ADDR bytesRead,NULL
invoke cmpmem,lpMem1,lpMem2,lof1 ; whole file compared!
...
ret
quit_nobuffer:
xor ebx, ebx
jmp @B
Ian_B
Thanks for the improved code, and for finding my mistake releasing the wrong memory pointers. Now that I look at it, using the rounded up value for the buffers and the reads and the actual length for the compare should have been a no-brainer. I made no real effort to use efficient code because I couldn't see how a few hundred clock cycles could make a significant difference with a 1,127,716-byte test file. You have a good point about the caching effects, but I coded this for an application where the files normally would be cached. I think the most common use for a file comparison would be after performing some operation on the files being compared, a COPY /V for example.
Just start reading this thread, This is what i get with filecmp4 on P3 846, 128 ram, XP no service pack, but stripped down with nLite. nLite should not
be an issue i don't think. Is this strange or what 3ms each...
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1
filecmp_heapbuffers: 3ms
filecmp_filemapping: 3ms
100.9%
filecmp_nobuffering: 3ms
filecmp_nobuffering_async: 3ms
The 100.9% is 100.0% .... typing error.
This is what i get with filecmp4 on my Intel Celeron 498 MHz, 384 ram, XP no service pack, but stripped down with nLite just like on other machine.
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x -1
filecmp_nobuffering_async: x x x x x x -1
filecmp_heapbuffers: 9ms
filecmp_filemapping: 10ms
111.1%
filecmp_nobuffering: 10ms
filecmp_nobuffering_async: 10ms
P4 2GHz W2KSP4
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1
filecmp_heapbuffers: 1639ms
filecmp_filemapping: 3226ms
filecmp_nobuffering: 6701ms
filecmp_nobuffering_async: 11401ms
Press any key to exit...
Regards, P1 :8)
Quote from: Ian_B on May 15, 2006, 12:30:43 AM
Maybe for a better test copy the file 40 times and rename the copies with a split number.
Consider allocating contiguous blocks to eliminate file fragmentation latency.
I wrote a version of Michael's test routine that worked like I suggested, saving 80 copies of windows.inc (unbuffered) and reading each one in as a separate file comparison. It's the only way to avoid caching effects and get a realistic estimate of the different speeds of handling file input. As I suspected, my previous experience was vindicated and FILE_FLAG_NO_BUFFERING is indeed the fastest way of reading files (and writing them, although there's a nasty wrinkle on the end where you have to close/reopen the file without the flag to set the filepointer to truncate the final write to the real EOF as you have been forced to over-write to a sector boundary). Having measured the substantial difference over reading/writing almost 900 1Mb files while I was testing my joiner app, I was pretty sure this more modest test would support me. :toothy The big surprise is the poor performance of the file-mapping code.
This code strips out everything unnecessary and keeps a single copy of the original file through the test. It is therefore getting an accurate estimate as Michael originally wanted of the "cost" of setting up a memory buffer if needed and doing the read into the buffer, and only that. Note that the asynchronous code performs pretty well, although this is a really trivial example of its use, reading in an entire file at once! To make the extra overhead of handling events and overlapped structures worthwhile, it's most generally used for dealing with chunks of a file so that you can process one data chunk while the next few chunks are being read in, so that in an ideal case the process thread(s) never has to wait for further reads to complete to get new data to process, or wait for any writes of the processed data to complete before starting a new chunk.
Michael, I understand your view of the caching working in your favour depending on what you want to do subsequently with the file, but the point of your original test was to compare read times, and non-buffering is provably the fastest way of doing that for the simple examples in the library or for most other applications where the important thing is simply to get the data in memory for some sort of processing.
Reading test file...
Creating comparison files...
filecmp_heapbuffers: 554ms
filecmp_filemapping: 687ms - 124.0%
filecmp_nobuffering: 528ms - 95.3%
filecmp_nobuffering_async: 537ms - 96.9%
Press ENTER to exit...
Make sure you alter the drive location of the testfile before testing. There may be a few easily-fixed wrinkles in the assemble due to small personalised variations in my MASM setup.
Ian_B
EDIT! Minor bugfix, doesn't affect functionality/timing, just cleanup for unmatched stack pushes on error.
[attachment deleted by admin]
QuoteReading test file...
Creating comparison files...
filecmp_heapbuffers: 1084ms
filecmp_filemapping: 1027ms - 94.7%
filecmp_nobuffering: 1128ms - 104.1%
filecmp_nobuffering_async: 993ms - 91.6%
Press ENTER to exit...
;;;
After the minor bigfixed
;;;
QuoteReading test file...
Creating comparison files...
filecmp_heapbuffers: 1252ms
filecmp_filemapping: 1015ms - 81.1%
filecmp_nobuffering: 1004ms - 80.2%
filecmp_nobuffering_async: 1453ms - 116.1%
Press ENTER to exit...
Guess the variation might be platform-dependent - my system is P4 Northwood, what's yours six_L?
Home xp sp2
Intel(R) Pentium(R) 1.4GHz
Ian,
You have convinced me. At least on most recent systems non-buffering is the fastest method of moving disk data that is not in the cache, into a memory buffer.
Thinking about your suggestion of using multiple test files it occurred to me that rotating such a small number of copies would be unlikely to eliminate all caching effects because the system cache is typically large, greater than 300MB on my 512MB system, for example.
For the asynchronous procedure and large files, might there be some benefit to breaking up the reads into smaller chunks, comparing each pair of chunks while the next pair was being read?
Results on my P3-500, Windows 2000 SP4 system:
filecmp_heapbuffers: 1016ms
filecmp_filemapping: 821ms - 80.8%
filecmp_nobuffering: 728ms - 71.7%
filecmp_nobuffering_async: 724ms - 71.3%
Quote from: MichaelW on May 16, 2006, 06:58:25 AM
You have convinced me. At least on most recent systems non-buffering is the fastest method of moving disk data that is not in the cache, into a memory buffer.
I wouldn't like to say it's cut and dried, especially from the small set of results posted here so far. It definitely seems to be so on my puter, that's all I'm sure of now. Having said that, running the test repeatedly shows a wide variation of results: although the initial heapbuffering is remarkably consistent in absolute timing for me, the filemapping always slower, the nobuffering ranges from 95% to 100% and the async code has reached 125% in one run. This really isn't a final "proof" of the relative merits of these different approaches, but it's a good indicator and a start if someone wants to make further experiments.
QuoteThinking about your suggestion of using multiple test files it occurred to me that rotating such a small number of copies would be unlikely to eliminate all caching effects because the system cache is typically large, greater than 300MB on my 512MB system, for example.
True. But the odd thing about your original test was that the non-buffering flag even seemed to be forcing the read to ignore the cached data
where it existed, which was why your non-buffering examples were comparing so appallingly, so I'm reasonably sure that by writing the test files non-buffered we should eliminate all pre-caching for the test fairly. And the results being so much closer together in this test tend to support that theory.
QuoteFor the asynchronous procedure and large files, might there be some benefit to breaking up the reads into smaller chunks, comparing each pair of chunks while the next pair was being read?
Exactly how it should be done to make the most of asynchronous I/O. When I was first introduced to that, my ASM mentor decribed his own experiments in testing speed against other apps, and he was sure from his testing that the optimum size of data chunk for async reading was 64Kb, as this was an internal buffer size for Windows. I've used that size in my own code successfully (my 2Mb read/write buffer has an array of 32 extended overlaps to control it), but I can't definitely confirm whether it is still the best. Obviously setting up that amount of overhead is something best done once at startup rather than on an ad-hoc basis every time you need to read a file, though. To help the management code, so I know what should have been read and whether I have a result pending in the overlap, I extend my overlaps like this (plus having 8 DWORDs helps wrap around from the end of the overlap array to the beginning, by using a MOD value):
OVERLAPPED_PLUS STRUCT
OVERLAPPED <> ; as normal OVERLAPPED plus extra 3 DWORDs
; OVERLAPPED STRUCT
; Internal DWORD ?
; InternalHigh DWORD ?
; OffsetLow DWORD ? ; this is better naming!
; OffsetHigh DWORD ?
; hEvent DWORD ?
; OVERLAPPED ENDS
Transferred DWORD ? ; this is the output variable for the read/write call
Requested DWORD ? ; this should match after a read/write call
; top bits used as flags to show last block in file
; and overlap operation pending for this block
BufferAddress DWORD ? ; pointer to the 64K buffer this overlap controls
OVERLAPPED_PLUS ENDS
Without going into multi-threading, the point of asynchronous I/O is to eliminate all the waiting for reads/writes to take place. In this case, by the time you have finished comparing the first chunk the rest of the data should be stacking up nicely in the later sub-buffers ready to test. This is a perfect use of async code, where the processing of one chunk is not dependent on other parts of the file. Zipping, encoding/decoding, CRCing, doing MD5s etc. are all prime candidates for this approach where there is a lot of processing that can be done while you'd otherwise be sitting waiting for the reads to supply new data.
Ian_B
Quote from: AMD XP 2500+ / XP Pro SP2 / nVIDIA Chipset
Reading test file...
Creating comparison files...
filecmp_heapbuffers: 743ms
filecmp_filemapping: 1133ms - 152.5%
filecmp_nobuffering: 514ms - 69.2%
filecmp_nobuffering_async: 457ms - 61.5%
P4 2GHz, W2K-SP4
Reading test file...
Creating comparison files...
filecmp_heapbuffers: 897ms
filecmp_filemapping: 870ms - 97.0%
filecmp_nobuffering: 875ms - 97.5%
filecmp_nobuffering_async: 876ms - 97.7%
Press ENTER to exit...
Regards, P1 :8)
Turion 64 / XP Home (32bit) SP2
Reading test file...
Creating comparison files...
filecmp_heapbuffers: 1364ms
filecmp_filemapping: 841ms - 61.7%
filecmp_nobuffering: 840ms - 61.6%
filecmp_nobuffering_async: 834ms - 61.1%
Press ENTER to exit...
You forget important.
First, avoid fragmentation - use on defragmented disks, SetFilePointer to allocate entire output size before start WriteFile.
Also, you only compare absolute time? You also should check CPU usage - filemapping might surprise you!
When test buffered approach (heapbuffer, filemap) make sure to test new file or flush fs cache. Note that filemap has no chance be uncached, always will cahce.
Quote from: VlasRad on May 17, 2006, 07:51:42 PM
You forget important.
First, avoid fragmentation - use on defragmented disks, SetFilePointer to allocate entire output size before start WriteFile.
VlasRad, there are many variables. The point of Michael's original test and my followup was to compare the different times of loading a file that a small, normal program might use. Most normal programs have to cope with fragmented disks, they don't defragment before they start reading/writing files! What we have now is something approaching a real-world test with real-world settings, I don't believe for a moment it's perfect, though, and I hope people will develop it for better results.
QuoteWhen test buffered approach (heapbuffer, filemap) make sure to test new file or flush fs cache. Note that filemap has no chance be uncached, always will cahce.
That's why in my test I make sure the copy files are written unbuffered. All 80 test files are new on every run and should be equally uncached for the four tests.
Ian_B
QuoteQuoteWhen test buffered approach (heapbuffer, filemap) make sure to test new file or flush fs cache. Note that filemap has no chance be uncached, always will cahce.
That's why in my test I make sure the copy files are written unbuffered. All 80 test files are new on every run and should be equally uncached for the four tests.
Good, sorry I not look completely into your test methods :(
Still fragmentation can have big effect punish one method needlessy. If use on defragmented drive and reserve filesize before test (simple SetFilePointer should sufficient and fast on NTFS) raw test be more fair.
QuoteVlasRad, there are many variables. The point of Michael's original test and my followup was to compare the different times of loading a file that a small, normal program might use.
Yes raw speed of routine is not only important factor, you need consider environment and how is typically used. Unbuffered+async good for massive file hashing/checksum, but not necisarily best for "normal" use.
QuoteMost normal programs have to cope with fragmented disks, they don't defragment before they start reading/writing files!
Yes but if you want measure speed of actual routines and not of fragmented filesystem, you need defragment file. Of course it might interesting to test routines on fragmented files but you need exactly same fragmented file then, and to not filesystem cache.
QuoteWhat we have now is something approaching a real-world test with real-world settings, I don't believe for a moment it's perfect, though, and I hope people will develop it for better results.
Please do the test of CPU usage while reading file, it is other important test factor. In the end you should really get similar task completion speeds because task I/O bound by slow hard drive, CPU usage is not bounded like that. Page faults expensive, and you get many on using mapped file.
Will these various I/O routines work with a RAM Disk ? (such as RAMDisk from Cenatek)
If so how will the routines perform and what would affect performance ?
I don't have one so I couldn't test it.
There are also hardware memory based drives, such as the Rocket Drive also from Cenatek,
and the i-RAM from Gigabyte, that are PCI cards with sticks of RAM to hold the data.
I see no reason they notwork with ram disk or i-ram, but performance difference hard to measure then, likely.
But cpu usage might be more interesting with that kind drives since not I/O bound.
Errm if you cared...
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 -1
filecmp_filemapping: 1 0 0 0 -1 -1 -1
filecmp_heapbuffers: 1ms
filecmp_filemapping: 1ms
100.0%
Press any key to exit...
seems to be no difference on my system.
2.8ghz opteron 144, 250gb maxline3 HD, xp sp2
filecmp4, P4 2.8 HT/160 HDD seagate/512 RAM
1st run 2nd run 3rd run
correct return vals: 1 0 0 0 -1 -1 1 1 0 0 0 -1 -1 1 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1 1 0 0 0 -1 -1 1 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1 1 0 0 0 -1 -1 1 1 0 0 0 -1 -1 1
filecmp_nobuffering: x x x x x x 1 x x x x x x 1 x x x x x x 1
filecmp_nobuffering_async: x x x x x x 1 x x x x x x 1 x x x x x x 1
filecmp_heapbuffers: 623ms 662ms 632ms
filecmp_filemapping: 307ms 306ms 279ms
49.3% 46.2% 44.1%
filecmp_nobuffering: 6524ms 8434ms 5928ms
filecmp_nobuffering_async: 12202ms 13022ms 9108ms
After the 1st/2nd time I ran Filecmp4, my comp struggles to revive and lags terribly for about 10secs.
3rd Run: After I switch Avast AV/Tiny personal FW OFF