FILE_FLAG_NO_BUFFERING usage.

KeepingRealBusy · April 15, 2010, 06:59:08 PM

I am processing some huge files, and want to speed up the process. I have already allocated huge buffers with VirtualAlloc.

Exactly how does this work when reading a file that is not mod sector size long? Should you truncate the read to mod 512, read the data, then read the partial sector - assuming that the system will buffer that partial block and move it to your buffer after reading? Or should you just read the whole thing and the system will read the sector sized pieces without buffering, and then read the last partial sector and move it to the buffer?

How do you write the data back out if you create the file with FILE_FLAG_NO_BUFFERING considering the final short sector?

In general, if you specify FILE_FLAG_NO_BUFFERING, but do I/O with something that is not sector sized, does the system (Windows XP) ignore the FILE_FLAG_NO_BUFFERING and buffer the whole I/O, or does it do as much as possible without buffering, and then process the final piece as buffered and moved?

Dave.

redskull · April 15, 2010, 07:30:18 PM

I belive if you read a non-sector sized block, the read/write will fail. Technically, all files are mod sector sized long (indeed, they are all mod cluster sized long) on disk. Personally, I would recommend you use section objects for such a task, but to each his own.

-r

clive · April 15, 2010, 08:29:56 PM

Consider MapViewOfFile
http://msdn.microsoft.com/en-us/library/aa366761%28VS.85%29.aspx

The file system only pins memory pages (4K) to the pieces of the file you actually touch.

Also consider how you are doing File IO. If you can pull in blocks of say 32K, sector or cluster aligned, things will work more effectively. If your records are not nicely sized/aligned, you would be better loading a larger buffer which is aligned, and compute an offset into that buffer where your data lives. If you change data write out the whole block (512 byte units thereof). This will save the file system from having to do the work of stitching the data back into sectors for you.

Another trick is to know what your file pointer is, and only tell the file system to move it when it is not in the right position.

-Clive

jj2007 · April 15, 2010, 10:21:04 PM

Quote from: KeepingRealBusy on April 15, 2010, 06:59:08 PM
I am processing some huge files, and want to speed up the process. I have already allocated huge buffers with VirtualAlloc.

That sounds like you are allocating Gigabytes ... most probably, there is an optimal size for that buffer well below the GB range.

The optimum may depend inter alia,
1. on the overhead (positioning the file pointer inside the block...)
2. on your disk cache (are the files present, or not)
3. on your CPU cache(s).
The latter depends on the way you are processing the data. If it's only one pass, a big buffer (>L2 cache size) may be appropriate, otherwise not.

Try various sizes, and time it. Since you are worried about real life performance, times will be in the seconds (?) range, so GetTickCount is precise enough.

redskull · April 15, 2010, 10:41:11 PM

If you explain more about what "processing" you are doing, there is probably a way best suited; are you reading mostly sequentially and writing a new file, reading randomly and altering the same file, reading only and generating only a modest subset (result based on the processed data, checksum, etc)? I guess a better question is "why do you think turning off buffering will be better?"

-r

MichaelW · April 15, 2010, 10:54:53 PM

There is some related information here, see posts by Ian_B:

http://www.masm32.com/board/index.php?topic=4768.0

jj2007 · April 16, 2010, 12:32:07 AM

Quote from: MichaelW on April 15, 2010, 10:54:53 PM
There is some related information here, see posts by Ian_B:

http://www.masm32.com/board/index.php?topic=4768.0

Interesting. Timings for this post:

Code Select

Reading test file...
Creating comparison files...

Run 1:
filecmp_heapbuffers: 719ms    <<<<<<<<<<< much lower for first run
filecmp_filemapping: 659ms - 91.7%
filecmp_nobuffering: 662ms - 92.1%
filecmp_nobuffering_async: 674ms - 93.7%

Run 2:
filecmp_heapbuffers: 1170ms
filecmp_filemapping: 668ms - 57.1%
filecmp_nobuffering: 659ms - 56.3%
filecmp_nobuffering_async: 674ms - 57.6%

Run 5:
filecmp_heapbuffers: 1185ms
filecmp_filemapping: 670ms - 56.5%
filecmp_nobuffering: 679ms - 57.3%
filecmp_nobuffering_async: 652ms - 55.0%

KeepingRealBusy · April 16, 2010, 03:44:16 AM

JJ,

Thank you for the link. Ian B's response on page 2 was what I needed to know about how to write a partial sector. This leaves a question. Do I really need to close and open the file to write the final partial sector? Can I just open two handles (create one with no buffering and then open one with buffering) to the same file? I would have to seek the second buffered file to the end of the written data before the write.

What about reading. Do I just round up the read size to the next sector bound and does the system just return a short BytesRead value with an un-buffered read?

For some of the other questions asked, here are the particulars.

I tried to sort a 500 million text string file, each entry a 10 digit incrementing number (this will later be randomized, but for now, the file is sequential). I used Windows sort. It took 45 minutes, and then about 4 more minutes to write the output file. I thought I could do better.

My initial stab was to read the file into an approximately 2GB (VirtualAlloc) buffer, piece by piece. I convert the pieces from (string,CRLF) into (WORD size, string), and write the pieces to another file. Along the way I calculate the maximum string length (I want a general solution) and record count. This conversion takes about 2 seconds for a 2GB buffer (trivial), the read about 4 minutes, the write the same. I then read the file back in and create a fixed length record containing (WORD record size), (WORD string size), string, (pad to mod 16 total length with 0), and write a third file. I then read the fixed length records into the 2GB buffer, piece by piece, and Quick sort them with my optimized Quicksort., writing the sorted data back on top of the unsorted data (two fields because the quicksort compares with DWORDS, and exchanges with XMM regs and only compares the largest string size and only exchanges the largest record size - keeping this as a generic process that allows binary data to follow or preceed the key string field), and tracking the file start and size of each of these sorted blocks. When all data is sorted, read back the data into small buffers, one buffer for each such sorted block. The buffers are connected in a linked list containing the seek and size, and the buffer location of the current record, and linked in the list in record sequence. Then the lowest record is written out to the output file as a string again (less the control and padding and with a CRLF), each new top record in the first buffer compared to the top record of the second buffer (only one compare needed) and the buffer unlinked and linked back in the list if the top record is not the lowest. This continues, writing, skipping to the next record, repeating or un-linking, refilling, and re-linking as the buffers empty, until all buffers exhaust.

The problem is the multiple handling of the data. One pass to read the text strings, one pass to write the structures (the entire file must be processed so the fixed length is known), one pass to read the structures, one pass to write the fixed length data, one pass (fragmented) to read the fixed length data, one pass (fragmented) to write the data back on top of the unsorted data, one pass (really fragmented) to read the merge small buffers, and one pass (extremely fragmented) to write out each record as a text string again.

The reason for the un-buffered I/O is because the system reads the data into it's buffer, then moves it to your buffer, thus doubling the data movement time. Read it directly into your buffer to begin with.

There are many other optimizations already included (side buffering the output data (the structures, the fixed length data, and the output text strings) and writing the data as large blocks - there is about 1MB more buffers, but fragmented into 4 blocks and not contiguous with the huge buffer - this is used for side buffering). I just want to peel this onion one smelly layer at a time, starting with non-buffered I/O.

Dave.

KeepingRealBusy · April 16, 2010, 04:26:45 AM

MichaelW,

Sorry, didn't mean to slight you. I re-read the posts and noticed that it was you that gave the link, and JJ was just quoting.

Thank you again for link.

Dave.

clive · April 16, 2010, 04:58:13 AM

Quote from: clive
If you can pull in blocks of say 32K, sector or cluster aligned, things will work more effectively.

I have FSD experience, but in case it needs to be demonstrated, the sweet spot for reading blocks into memory is generally 32K/64K across most Microsoft OS's I've played with. Here processing a 53 MB file at assorted blocking sizes, which I think it cached. YMMV benchmark accordingly.

Code Select

       512 CCITT CRC-32 3A4BA410 (54425496 Bps)
      1024 CCITT CRC-32 3A4BA410 (60106272 Bps)
      2048 CCITT CRC-32 3A4BA410 (60106272 Bps)
      4096 CCITT CRC-32 3A4BA410 (62345388 Bps)
      8192 CCITT CRC-32 3A4BA410 (63453422 Bps)
     16384 CCITT CRC-32 3A4BA410 (64601554 Bps)
     32768 CCITT CRC-32 3A4BA410 (65954049 Bps)
     65536 CCITT CRC-32 3A4BA410 (65954049 Bps)
    131072 CCITT CRC-32 3A4BA410 (64679575 Bps)
    262144 CCITT CRC-32 3A4BA410 (64679575 Bps)
    524288 CCITT CRC-32 3A4BA410 (63453422 Bps)
   1048576 CCITT CRC-32 3A4BA410 (63453422 Bps)
   2097152 CCITT CRC-32 3A4BA410 (63453422 Bps)
   4194304 CCITT CRC-32 3A4BA410 (63528693 Bps)

-Clive

jj2007 · April 16, 2010, 06:54:15 AM

Dave,
Just to get a better understanding:

> I tried to sort a 500 million text string file, each entry a 10 digit incrementing number
So each string has the format "123456", CrLf$? What is the maximum number represented? Would it fit into a DWORD?

> My initial stab was to read the file into an approximately 2GB (VirtualAlloc) buffer
500,000,000*(10+2)=6GB - how does that work with a VirtualAlloc of only 2GB?

>I used Windows sort.
45 minutes, wow...! Which API is that?

My tests with Ian's algos say that the reading step does not differ wildly for various buffering options:

Code Select

Prescott:
filecmp_heapbuffers: 241ms
filecmp_filemapping: 235ms - 97.5%
filecmp_nobuffering: 215ms - 89.2%
filecmp_nobuffering_async: 218ms - 90.5%
FilesDiffer (MasmBasic - SSE2): 210ms - 87.1%

All algos do a byte for byte comparison but while the first 4 compare to a fixed buffer, the FilesDiffer also loads the comparison buffer for every loop. So the comparison seems much more important than the loading.

redskull · April 16, 2010, 01:12:34 PM

It's all the multiple reads and writes that are killing you; buffering will not have nearly as much effect of redesigning the program to minimize the number of passes that need to be done. First, since we are just talking about numbers, it's much better to convert it to the hex equivlenet, instead of the string representation itself; then, store the number in a "fancy" data structure, probably a B-tree of some style. You can map different sections of the input file a 'chunk' at a time, read through it, convert it, place it in the tree, and repeat. Then you open the output file, and do the whole thing in reverse.

The problem is that even with numbers instead of strings, the entire B-tree won't fit in RAM (even with AWE). However, because it's always sorted, you can create a sort of 'index' to it, and write chunks of it out disk. Keep a secondary structure in memory, which maps the minimum and maximum value of that part of the tree to it's location on disk, so when you need to add another value to the tree, you can just map a different part of the 2nd disk file.

-r

KeepingRealBusy · April 17, 2010, 10:37:46 PM

Thank you all for the replies and hints.

I have been inserting (in the program) timing output lines to the console for all reads, writes, converts, sorts, etc. I will save the execution log, then as I make changes, I will save the log and check to see if this improves the situation. If anyone is interested, I can post the initial timings. Right now I have been debugging an apparent hang in the sort which turns out not to be a hang, but the sort is peeling off the last line with each pass - only 499,999,999 passes to go. This may not be an error, just an anomoly of the way QuickSort works.

Clive,

Quote
I have FSD experience, but in case it needs to be demonstrated, the sweet spot for reading blocks into memory is generally 32K/64K across most Microsoft OS's

I figured that out long ago - my file I/O wrappers always split the File I/O into 64KB requests and repeats until the request is satisfied.

JJ,

Quote
how does that work with a VirtualAlloc of only 2GB?

I just fill my big buffer multiple times, turns out to be 3 times for the input file.

Quote
What is the maximum number represented?

The strings are 10 digit so they can hold 9,999,999,999. My file has 500,000,000 lines.

Quote
45 minutes, wow...! Which API is that?

My 45 minute sort was Windows Sort.exe, not an API.

Code Select


My tests with Ian's algos say that the reading step does not differ wildly for various buffering options:

The filecmp executions will be swamped by the compare times so really do not address the file I/O differences.

redskull,

In general, I was trying to use my optimized QuickSort. If that does not improve the execution time, then I may drop back into my b++ tree sort.

Code Select


First, since we are just talking about numbers, it's much better to convert it to the hex equivlenet, instead of the string

I did not want to convert the numbers to binary because my intent was to make this a freestanding string sort executable.

Dave

News:

FILE_FLAG_NO_BUFFERING usage.