News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing large file reads

Started by KeepingRealBusy, May 01, 2010, 09:01:03 PM

Previous topic - Next topic

KeepingRealBusy

I was testing reading a 6,000,000,000 BYTE file (5,000,000,000 ten character
strings).

I allocate the memory block with VirtualAlloc. I open and close the file between
each test of three reads. I thought that the system might have been caching the
files so I created three different files with the same data. I even defragged
the files with JKDefreg (the log reports that the three files are all single
fragment). My read routine splits any request into multiple requests of 64KB or
less.


The first set of reads is with FILE_FLAG_NO_BUFFERING+FILE_ATTRIBUTE_TEMPORARY
The second set of reads is with FILE_ATTRIBUTE_TEMPORARY
The third set of reads is with FILE_ATTRIBUTE_NORMAL

The current time is: 12:01:46.17
Read strings,                  size =   2,084,503,552 BYTES, the elapsed time is:       27.26
Read strings,                  size =   2,084,503,552 BYTES, the elapsed time is:     1:01.03
Read strings,                  size =   1,830,992,896 BYTES, the elapsed time is:       28.51

Read strings,                  size =   2,084,503,552 BYTES, the elapsed time is:       27.03
Read strings,                  size =   2,084,503,552 BYTES, the elapsed time is:     1:03.51
Read strings,                  size =   1,830,992,896 BYTES, the elapsed time is:        2.12

Read strings,                  size =   2,084,503,552 BYTES, the elapsed time is:       27.06
Read strings,                  size =   2,084,503,552 BYTES, the elapsed time is:     1:02.64
Read strings,                  size =   1,830,992,896 BYTES, the elapsed time is:        2.21
The current time is: 12:06:48.07


Why does the second read take twice as long as the first? It is the same ammount
of data, into the same buffer. Then, why is the third read back down to 28
seconds.

Why is there no difference between the first reads unbuffered vs. buffered?

Finally, WHY does the third read finish in 2 seconds with both buffered reads?

This was all brought about by trying to speed up a string sort. I wanted to sort
fixed length strings so I read all of the variable length strings first from one
file and converted them to variable structures and wrote them to a second file,
and in the process obtained the length of the longest string. I then read in the
structures from the second file and converted them into fixed length strings and
wrote them to a third file. I then read in the fixed length strings block by
block and sorted them with Quicksort, and then wrote them back out on top of the
original data. Then I read the sorted string blocks into smaller buffers (4 of
them) defined within the big buffer, reading the blocks piece by piece as the
buffers emptied. I linked these buffers in ascending sequence on the first
string in each block. Then I merged the four blocks to the the final output
file, converting the fixed length strings to text strings again.

The following is the timing for this process. Note: the input is the same input
as the above reading tests. Note further, the total time is only 23:50.15. Windows
SORT.EXE takes 47:31.48. My process takes about half of the time as for SORT.EXE,
and this is terrible code (that is why I an trying to improve it).


The current time is: 15:59:24.95
Initialize sort,               size =               0 BYTES, the elapsed time is:        0.00
Get big memory block,          size =   2,084,503,552 BYTES, the elapsed time is:        0.29
Get small memory blocks,       size =      54,976,512 BYTES, the elapsed time is:        0.00

(Convert variable strings to variable structures.)

Read strings,                  size =   2,084,503,550 BYTES, the elapsed time is:       33.54
Convert strings,               size =   2,084,503,548 BYTES, the elapsed time is:        6.40
Write structures,              size =   2,084,503,548 BYTES, the elapsed time is:       34.93

Read strings,                  size =   2,084,503,550 BYTES, the elapsed time is:       34.57
Convert strings,               size =   2,084,503,548 BYTES, the elapsed time is:        6.28
Write structures,              size =   2,084,503,548 BYTES, the elapsed time is:       34.90

Read strings,                  size =   1,830,992,904 BYTES, the elapsed time is:       28.67
Convert strings,               size =   1,830,992,904 BYTES, the elapsed time is:        5.28
Write structures,              size =   1,830,992,904 BYTES, the elapsed time is:       31.00

(Convert variable structures to fixed strings.)

Read structures,               size =   2,084,503,538 BYTES, the elapsed time is:       32.54
Write fixed,                   size =      54,976,512 BYTES, the elapsed time is:        0.85
    .
    .
    .
Read structures,               size =   2,084,503,538 BYTES, the elapsed time is:       32.75
Read structures,               size =   1,830,992,928 BYTES, the elapsed time is:       29.09
Read structures,               size =              12 BYTES, the elapsed time is:        0.00
Write fixed,                   size =      28,405,760 BYTES, the elapsed time is:        0.51
Total for write fixed,         size =   8,000,000,000 BYTES, the elapsed time is:     2:16.60

(Sorting the blocks.)

Read fixed,                    size =   2,084,503,360 BYTES, the elapsed time is:       36.50
Sort fixed,                    size =   2,084,503,360 BYTES, the elapsed time is:       34.95
Write sorted,                  size =   2,084,503,360 BYTES, the elapsed time is:     1:58.35

Read fixed,                    size =   2,084,503,360 BYTES, the elapsed time is:       38.92
Sort fixed,                    size =   2,084,503,360 BYTES, the elapsed time is:       36.48
Write sorted,                  size =   2,084,503,360 BYTES, the elapsed time is:     1:59.11

Read fixed,                    size =   2,084,503,360 BYTES, the elapsed time is:       36.12
Sort fixed,                    size =   2,084,503,360 BYTES, the elapsed time is:       36.64
Write sorted,                  size =   2,084,503,360 BYTES, the elapsed time is:     2:01.34

Read fixed,                    size =   1,746,489,920 BYTES, the elapsed time is:       30.59
Sort fixed,                    size =   1,746,489,920 BYTES, the elapsed time is:       32.04
Write sorted,                  size =   1,746,489,920 BYTES, the elapsed time is:       29.10

(Merging the sorted blocks).

Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.09
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.06
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:       11.56
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.01
Write strings,                 size =      54,976,512 BYTES, the elapsed time is:        0.87
    .
    .
    .
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.07
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.04
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:       10.84
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.03
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.01
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        9.87
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        8.96
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        8.93
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:       10.46
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        8.98
Read sorted,                   size =     521,125,840 BYTES, the elapsed time is:        8.98
Read sorted,                   size =     183,112,400 BYTES, the elapsed time is:        3.10
Write strings,                 size =       7,560,192 BYTES, the elapsed time is:        0.12
Total for write strings,       size =   6,000,000,000 BYTES, the elapsed time is:     1:39.74
Total for read sorted,         size =   6,000,000,000 BYTES, the elapsed time is:     2:25.07
The current time is: 16:23:15.10

16:23:15.10
-15:59:24.95
-----------
00:23:50.15     Time to read, sort, report time.



The current time is:  07:43:32.45
\cp\cpsortit\debug>if exist textseq.txt sort.exe textseq.txt /O sorted13.txt
The current time is:  08:31:04.25

08:31:04.25
-07:43:32.45
-----------
00:47:31.48     Time for sort.exe


Note the read, convert, write process. I am reading one file, converting in
place in the buffer, then writing the buffer to the second file, then reading
the next block into the buffer and continuing. Note: There is no slow down
during this read process as there is with read, read, read in the first tests
(this was all buffered I/O at this point). WHY ???

Note the read, sort, write process. I get the same long (double) time to write
the data back on top of the unsorted data in the file.

Aha! Except for the last sequence! The same is hapening with read,read,read.


    long, long, long, short        and     short, long, short

Does this have anything to do with end of file processing?

Dave.

oex

Quote from: KeepingRealBusy on May 01, 2010, 09:01:03 PM
I was testing reading a 6,000,000,000 BYTE file (5,000,000,000 ten character
strings).

I've only had a quick glance but that doesnt seem to add up :lol
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

hutch--

KeepingRealBusy,

Try allocating the three buffers at the same time and run the tests without deallocating any of them. Make sure they are aligned the same as well.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ghandi

I'm definitely no Oxford scholar, but i agree with oex that 10 x 5,000,000,000 != 6,000,000,000. Seeing as we're using Decimal (Base10) each position is a multiple of ten and we're working with whole numbers, all we need to do to calculate this is add another 0 to the number. This can be explained in a more technical or accurate fashion by any of the math guys here, i'm only explaining it as a layman.

5,000,000,000 x 10 = 50,000,000,000

HR,
Ghandi

jj2007

Quote from: Ghandi on May 05, 2010, 10:59:57 AM
5,000,000,000 x 10 = 50,000,000,000

Take the CrLfs on board: 5,000,000,000 x (10+2) = 60,000,000,000

KeepingRealBusy

JJ,

Thank you for responding, I've been busy trying to re-write the algo. Your answer is correct, the strings are text file strings with CRLF's, so 600,000,000 bytes.

Dave.

excisetax

Hi,

Is it 60,000,000,000 or 600,000,000 bytes?

How much RAM does the computer have?


Slugsnack

Quote from: excisetax on June 07, 2010, 08:37:15 PM
Hi,

Is it 60,000,000,000 or 600,000,000 bytes?

How much RAM does the computer have?


doesn't matter. that's what the virtual memory abstraction is for.

hutch--

Physical memory does matter in terms of timings, virtual memory uses disk as the extra space but at the cost of disk IO speed where allovcate blocks are not shifted in and out to disk. You need to keep an eye on the total memory used in relation to how much is installed on the machine and you will pay a substantial penalty in speed terms if you allocate more than the machine has available.

Were I faced with working on file of 2 gig or larger I would be inclined to set a smaller size and read them in bits, if the machine has 2 gig available, I would try for a 1 gig memory allocation and read the file into that buffer a gig at a time. With files this large Win32 does not support bigger than 2 gig allocated or a little more if you use a special linking so you are stuck with breaking it up. As long as you are using NTFS you can chomp your way through bigger files than the 2 gig effective limit.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

KeepingRealBusy

All,

I am on about the seventh or eighth iteration of this little project. If anyone is interested in the history file of this saga (a .txt file), I can post it as a zip. I've run into all sorts of interesting things that work, and things that do not work. The file contains all sorts of timing data and code snippets explaining what was being changed at this point.

The best I've been able to do is a time of 17:22.20 with the above stated file, and this started with 47:31.80 for Windows SORT.EXE.

Dave.

Farabi

So there is no other ways to read a file above 2 GB except split it as a 2GB chunk?
Or VirtualAlloc does it? It VAlloc did it, how to access the memory above 4 GB? or it was used 64-bit variable?
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

KeepingRealBusy

I am not sure about the use of memory mapped files and whether they can access huge files, but there is something else I just ran into - ADDRESS_WINDOWS_EXTENSIONS for VirtualAlloc which appears to allow you to allocate memory that is NOT paged (will allow huge memory accesses for servers??? maybe). This would certainly speed up a QuickSort with indirect pointers (one of the problems I ran into). My feelings about "general" solutions to problems sort of (all puns intended) match my feelings about Windows SORT.EXE - with general solutions you get general performance. If you want fast performance, code it yourself. Memory mapped files has come up, I think in the Laboratory, and one of the drawbacks is that you must touch each page to get that page read in. If you are only accessing the file sparsely, then this would be good enough, but to sort a whole file you need to read ALL of the data, and only reading a page when you first touched it instead of reading 64KB at a time wold sure slow you down.

Dave.

Tedd

You can access up to 64-bit offsets with memory mapped files, which is considerably more than 2GB (closer to 16.8 Million Terabytes.)
The limitation is that you can't access it all at once (since it won't fit into memory anyway) so you have a 'window' that you can shift around and access anything within that window - the size of that window is limited by available memory, which a maximum of 2GB for general access; not that you should ever allocate that much unless you plan on killing the performance of every single process.
No snowflake in an avalanche feels responsible.

redskull

Something to keep in mind is that when you VirtualAlloc() address space, the pages are "demand-zero"; essentially, the pages need to be zeroed out, byte-by-byte, before the system can give them to you.  When you commit most of the physical memory, and then unallocate it and reallocate it, your entire stash of prezeroed pages is most likely depleted, and the system takes a break to clear them out again for the next use.  Using memory mapped files avoids this bottleneck, because the system "knows" the data will be overwritten by the file data before you get access to it.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

jj2007

Quote from: redskull on June 14, 2010, 11:48:19 AM
Something to keep in mind is that when you VirtualAlloc() address space, the pages are "demand-zero"; essentially, the pages need to be zeroed out, byte-by-byte, before the system can give them to you.

Interesting enough, if you use HeapAlloc instead of VirtualAlloc, there is a precise point when your memory returns zeroed:
include \masm32\include\masm32rt.inc

.code
start:
invoke GetProcessHeap
push eax
bytes = 126*4096 ; # of pages requested - try 127 to see zeroed memory (Win XP)
invoke HeapAlloc, eax, 0, bytes ; one GB
mov ebx, eax
mov esi, eax
mov ecx, bytes/4
.Repeat
lodsd
dec ecx
.Until eax || ecx<=0
MsgBox 0, str$(ecx), "Loops left:", MB_OK
pop eax ; the heap
invoke HeapFree, eax, 0, ebx
exit
end start


Plain HeapAlloc does not zero memory. However, if you request more than 126 pages, i.e. more than about 520,000 bytes, Windows switches to VirtualAlloc and zeroes memory.