File Compare

MichaelW · May 14, 2006, 07:00:38 PM

Thanks Ian, I was hoping you would take a look at the code and comment on it. The truncation, the choice of using virtual allocate, and the fixed sector size were all just sloppy shortcuts to save time. My focus was on reasonably verifying that file mapping is the best method for this application. I was assuming that the normal sector size is still 512 bytes, is this no longer so?

I modified the filecmp_nobuffering and filecmp_nobuffering_async procedures to allocate from the heap and calculate and use sector-aligned start addresses for the buffers. As a matter of convenience I am still truncating the files to a multiple of the sector size, but given the size of the test file this should not have any significant effect. On my system filecmp_nobuffering is a little faster and filecmp_nobuffering_async much slower. I have replaced the last attachment with the new version.

Ian_B · May 15, 2006, 12:30:43 AM

OK, this is bugging me. I've rewritten the non-async no_buffering code to strip out all the slow shifts (the sector-alignment can be done with ANDs using a MOD value made from the sector-size -1) and I still can't get the elapsed time down from nearly 6000ms compared to the simple heap code at 1400ms. That's even with the sector-size check done out of the loop. I can't believe those few extra instructions are causing such an enormous slow-up, apart from those it's doing exactly the same as the simple code, allocating memory X 2, reading in 2 files to memory and doing exactly the same compare. :eek

However... logic is now returning and is telling me that the reason may be the simple fact of reading using FILE_FLAG_NO_BUFFERING. If you do 100 reads in a loop WITHOUT using that flag, the kernel will cache it, and it will not read from disk 99 of those times but will fetch from memory. And when you test the filemapping code straight after it doesn't even have to read it once from disk! :bdg Whereas the no_buffered version almost certainly has no such luck. Doing the test using multiple repeat reads of the same file is not going to work to give a clear answer of which may be faster in general usage, when you tend to read a file only once. :wink

Maybe for a better test copy the file 40 times and rename the copies with a split number. (This could be automated in the app set-up, just make sure the files are written no-buffered!! Don't forget to delete afterwards...) Then either put all these names in an array and work through them using a pointer that updates by the length of the name, or use one name and update the split number in it between tests. That way you can do 10 tests per proc (comparing against the original file) and every test gets a different but identical file that won't be cached. Doing it this way means you can simplify the code to only read in the test file too, you may as well keep the original in memory through the tests.

For interest, here's my version of the sector-alignment code, it works identically for rounding up the memory allocation startpoint. Note that there was an error in your code, you released the wrong memory pointers in the no-buffering section (the rounded numbers, not the membase values) and in the two no-buffering timer loops there's a superfluous PUSH EAX.

Code Select

    invoke GetDiskFreeSpace,NULL,    ; inside or outside loop
                            ADDR junkreturn,
                            ADDR bytesPerSector,
                            ADDR junkreturn,
                            ADDR junkreturn

...

    mov ebx, bytesPerSector
    sub ebx, 1              ; make MOD value from power of 2 returned

    ; Get file size and extend to multiple of sector size

    invoke GetFileSize,hFile1,NULL
    mov ecx, ebx
    mov lof1, eax           ; save real length of file, use for compare function
    test eax, ebx           ; is there a remainder
    jz  @F

    not ecx
    add eax, ebx            ; add sector-length - 1
    and eax, ecx            ; round up to next sector-length
@@:
    mov secbytes1, eax      ; save correct sector-multiple read length
; you need this value as an extra LOCAL, must be same for second file

    invoke GetFileSize,hFile2,NULL
    cmp lof1, eax
    jne quit_nobuffer
; but you can save yourself lof2 as this must be identical to lof1

    ; Allocate memory and calc sector-aligned start address

    invoke GetProcessHeap
    mov ecx, secbytes1
    xor edx, edx
    add ecx, ebx            ; max needed memsize is rounded length + (sector-1)
    push ecx                ; set params now for next HeapAlloc call
    push edx
    push eax
    invoke HeapAlloc,eax,edx,ecx
    mov memBase1, eax
    mov ecx, ebx
    test eax, ebx           ; is mem allocation sector-aligned
    jz  @F

    not ecx
    add eax, ebx            ; add sector-length - 1
    and eax, ecx            ; round up to next sector-multiple start
@@:
    mov lpMem1, eax
    call HeapAlloc          ; params already on stack
    mov memBase2, eax
    mov ecx, ebx
    test eax, ebx           ; is mem allocation sector-aligned
    jz  @F

    not ecx
    add eax, ebx            ; add sector-length - 1
    and eax, ecx            ; round up to next sector-multiple start
@@:
    mov lpMem2, eax

    invoke ReadFile,hFile1,lpMem1,secbytes1,ADDR bytesRead,NULL
    invoke ReadFile,hFile2,lpMem2,secbytes1,ADDR bytesRead,NULL

    invoke cmpmem,lpMem1,lpMem2,lof1    ; whole file compared!

...

    ret

  quit_nobuffer:
    xor ebx, ebx
    jmp @B

Ian_B

MichaelW · May 15, 2006, 06:49:47 AM

Thanks for the improved code, and for finding my mistake releasing the wrong memory pointers. Now that I look at it, using the rounded up value for the buffers and the reads and the actual length for the compare should have been a no-brainer. I made no real effort to use efficient code because I couldn't see how a few hundred clock cycles could make a significant difference with a 1,127,716-byte test file. You have a good point about the caching effects, but I coded this for an application where the files normally would be cached. I think the most common use for a file comparison would be after performing some operation on the files being compared, a COPY /V for example.

ic2 · May 15, 2006, 03:34:12 PM

Just start reading this thread, This is what i get with filecmp4 on P3 846, 128 ram, XP no service pack, but stripped down with nLite. nLite should not
be an issue i don't think. Is this strange or what 3ms each...

correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1

filecmp_nobuffering: x x x x x x 1

filecmp_nobuffering_async: x x x x x x 1

filecmp_heapbuffers: 3ms
filecmp_filemapping: 3ms

100.9%

filecmp_nobuffering: 3ms

filecmp_nobuffering_async: 3ms

ic2 · May 15, 2006, 03:38:28 PM

The 100.9% is 100.0% .... typing error.

ic2 · May 15, 2006, 03:57:06 PM

This is what i get with filecmp4 on my Intel Celeron 498 MHz, 384 ram, XP no service pack, but stripped down with nLite just like on other machine.

correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1

filecmp_nobuffering: x x x x x x -1

filecmp_nobuffering_async: x x x x x x -1

filecmp_heapbuffers: 9ms
filecmp_filemapping: 10ms

111.1%

filecmp_nobuffering: 10ms

filecmp_nobuffering_async: 10ms

P1 · May 15, 2006, 04:52:57 PM

P4 2GHz W2KSP4
correct return vals: 1 0 0 0 -1 -1 1
filecmp_heapbuffers: 1 0 0 0 -1 -1 1
filecmp_filemapping: 1 0 0 0 -1 -1 1

filecmp_nobuffering: x x x x x x 1

filecmp_nobuffering_async: x x x x x x 1

filecmp_heapbuffers: 1639ms
filecmp_filemapping: 3226ms

filecmp_nobuffering: 6701ms

filecmp_nobuffering_async: 11401ms

Press any key to exit...

Regards, P1 :8)

Mark Jones · May 15, 2006, 07:01:34 PM

Quote from: Ian_B on May 15, 2006, 12:30:43 AM
Maybe for a better test copy the file 40 times and rename the copies with a split number.

Consider allocating contiguous blocks to eliminate file fragmentation latency.

Ian_B · May 16, 2006, 12:23:20 AM

I wrote a version of Michael's test routine that worked like I suggested, saving 80 copies of windows.inc (unbuffered) and reading each one in as a separate file comparison. It's the only way to avoid caching effects and get a realistic estimate of the different speeds of handling file input. As I suspected, my previous experience was vindicated and FILE_FLAG_NO_BUFFERING is indeed the fastest way of reading files (and writing them, although there's a nasty wrinkle on the end where you have to close/reopen the file without the flag to set the filepointer to truncate the final write to the real EOF as you have been forced to over-write to a sector boundary). Having measured the substantial difference over reading/writing almost 900 1Mb files while I was testing my joiner app, I was pretty sure this more modest test would support me. :toothy The big surprise is the poor performance of the file-mapping code.

This code strips out everything unnecessary and keeps a single copy of the original file through the test. It is therefore getting an accurate estimate as Michael originally wanted of the "cost" of setting up a memory buffer if needed and doing the read into the buffer, and only that. Note that the asynchronous code performs pretty well, although this is a really trivial example of its use, reading in an entire file at once! To make the extra overhead of handling events and overlapped structures worthwhile, it's most generally used for dealing with chunks of a file so that you can process one data chunk while the next few chunks are being read in, so that in an ideal case the process thread(s) never has to wait for further reads to complete to get new data to process, or wait for any writes of the processed data to complete before starting a new chunk.

Michael, I understand your view of the caching working in your favour depending on what you want to do subsequently with the file, but the point of your original test was to compare read times, and non-buffering is provably the fastest way of doing that for the simple examples in the library or for most other applications where the important thing is simply to get the data in memory for some sort of processing.

Code Select

Reading test file...
Creating comparison files...

filecmp_heapbuffers: 554ms

filecmp_filemapping: 687ms - 124.0%

filecmp_nobuffering: 528ms - 95.3%

filecmp_nobuffering_async: 537ms - 96.9%

Press ENTER to exit...

Make sure you alter the drive location of the testfile before testing. There may be a few easily-fixed wrinkles in the assemble due to small personalised variations in my MASM setup.

Ian_B

EDIT! Minor bugfix, doesn't affect functionality/timing, just cleanup for unmatched stack pushes on error.

[attachment deleted by admin]

six_L · May 16, 2006, 01:00:23 AM

QuoteReading test file...
Creating comparison files...

filecmp_heapbuffers: 1084ms

filecmp_filemapping: 1027ms - 94.7%

filecmp_nobuffering: 1128ms - 104.1%

filecmp_nobuffering_async: 993ms - 91.6%

Press ENTER to exit...

;;;
After the minor bigfixed
;;;

QuoteReading test file...
Creating comparison files...

filecmp_heapbuffers: 1252ms

filecmp_filemapping: 1015ms - 81.1%

filecmp_nobuffering: 1004ms - 80.2%

filecmp_nobuffering_async: 1453ms - 116.1%

Press ENTER to exit...

Ian_B · May 16, 2006, 01:03:38 AM

Guess the variation might be platform-dependent - my system is P4 Northwood, what's yours six_L?

six_L · May 16, 2006, 02:01:15 AM

Home xp sp2
Intel(R) Pentium(R) 1.4GHz

MichaelW · May 16, 2006, 06:58:25 AM

Ian,

You have convinced me. At least on most recent systems non-buffering is the fastest method of moving disk data that is not in the cache, into a memory buffer.

Thinking about your suggestion of using multiple test files it occurred to me that rotating such a small number of copies would be unlikely to eliminate all caching effects because the system cache is typically large, greater than 300MB on my 512MB system, for example.

For the asynchronous procedure and large files, might there be some benefit to breaking up the reads into smaller chunks, comparing each pair of chunks while the next pair was being read?

Results on my P3-500, Windows 2000 SP4 system:

Code Select


filecmp_heapbuffers: 1016ms

filecmp_filemapping: 821ms - 80.8%

filecmp_nobuffering: 728ms - 71.7%

filecmp_nobuffering_async: 724ms - 71.3%

Ian_B · May 16, 2006, 09:44:34 AM

Quote from: MichaelW on May 16, 2006, 06:58:25 AM
You have convinced me. At least on most recent systems non-buffering is the fastest method of moving disk data that is not in the cache, into a memory buffer.

I wouldn't like to say it's cut and dried, especially from the small set of results posted here so far. It definitely seems to be so on my puter, that's all I'm sure of now. Having said that, running the test repeatedly shows a wide variation of results: although the initial heapbuffering is remarkably consistent in absolute timing for me, the filemapping always slower, the nobuffering ranges from 95% to 100% and the async code has reached 125% in one run. This really isn't a final "proof" of the relative merits of these different approaches, but it's a good indicator and a start if someone wants to make further experiments.

QuoteThinking about your suggestion of using multiple test files it occurred to me that rotating such a small number of copies would be unlikely to eliminate all caching effects because the system cache is typically large, greater than 300MB on my 512MB system, for example.

True. But the odd thing about your original test was that the non-buffering flag even seemed to be forcing the read to ignore the cached data where it existed, which was why your non-buffering examples were comparing so appallingly, so I'm reasonably sure that by writing the test files non-buffered we should eliminate all pre-caching for the test fairly. And the results being so much closer together in this test tend to support that theory.

QuoteFor the asynchronous procedure and large files, might there be some benefit to breaking up the reads into smaller chunks, comparing each pair of chunks while the next pair was being read?

Exactly how it should be done to make the most of asynchronous I/O. When I was first introduced to that, my ASM mentor decribed his own experiments in testing speed against other apps, and he was sure from his testing that the optimum size of data chunk for async reading was 64Kb, as this was an internal buffer size for Windows. I've used that size in my own code successfully (my 2Mb read/write buffer has an array of 32 extended overlaps to control it), but I can't definitely confirm whether it is still the best. Obviously setting up that amount of overhead is something best done once at startup rather than on an ad-hoc basis every time you need to read a file, though. To help the management code, so I know what should have been read and whether I have a result pending in the overlap, I extend my overlaps like this (plus having 8 DWORDs helps wrap around from the end of the overlap array to the beginning, by using a MOD value):

Code Select

OVERLAPPED_PLUS STRUCT
                        OVERLAPPED <> ; as normal OVERLAPPED plus extra 3 DWORDs
; OVERLAPPED STRUCT
;   Internal            DWORD ?
;   InternalHigh        DWORD ?
;   OffsetLow           DWORD ?   ; this is better naming!
;   OffsetHigh          DWORD ?
;   hEvent              DWORD ?
; OVERLAPPED ENDS

  Transferred           DWORD ?   ; this is the output variable for the read/write call
  Requested             DWORD ?   ; this should match after a read/write call
                                  ; top bits used as flags to show last block in file
                                  ; and overlap operation pending for this block
  BufferAddress         DWORD ?   ; pointer to the 64K buffer this overlap controls
OVERLAPPED_PLUS ENDS

Without going into multi-threading, the point of asynchronous I/O is to eliminate all the waiting for reads/writes to take place. In this case, by the time you have finished comparing the first chunk the rest of the data should be stacking up nicely in the later sub-buffers ready to test. This is a perfect use of async code, where the processing of one chunk is not dependent on other parts of the file. Zipping, encoding/decoding, CRCing, doing MD5s etc. are all prime candidates for this approach where there is a lot of processing that can be done while you'd otherwise be sitting waiting for the reads to supply new data.

Ian_B

Mark Jones · May 16, 2006, 05:03:32 PM

Quote from: AMD XP 2500+ / XP Pro SP2 / nVIDIA Chipset
Reading test file...
Creating comparison files...

filecmp_heapbuffers: 743ms

filecmp_filemapping: 1133ms - 152.5%

filecmp_nobuffering: 514ms - 69.2%

filecmp_nobuffering_async: 457ms - 61.5%

News:

File Compare

Ian_B

ic2

ic2

ic2

Ian_B

Ian_B

Ian_B