News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Unusual result with two memory copy algos.

Started by hutch--, December 30, 2006, 11:27:23 PM

Previous topic - Next topic

hutch--

I have been playing with a couple of memory copy algos, a standard REP MOVSD version and an unrolled register version and have found an interesting effect depending on the distance in address terms between one buffer and another. With 2 16k buffers the unrolled register version is faster in block copy up to a tested 8192 bytes but when the buffer size increases to 1 meg or over, the built in REP MOVSD version is noticably faster on the two PIVs I work on.

The only explanation is one of page thrashing as the size of the two buffers ensures that both cannot be in cache at the same time. The interesting part is that REP MOVSD is far less effected than the unrolled register version.

Here are the results on the later PIV.


16k buffer
----------
297 32 byte mem copy
125 32 byte reg copy
296 32 byte mem copy
125 32 byte reg copy
297 32 byte mem copy
125 32 byte reg copy
297 32 byte mem copy
125 32 byte reg copy
235 128 byte mem copy
125 128 byte reg copy
250 128 byte mem copy
125 128 byte reg copy
234 128 byte mem copy
141 128 byte reg copy
234 128 byte mem copy
141 128 byte reg copy
437 512 byte mem copy
250 512 byte reg copy
438 512 byte mem copy
234 512 byte reg copy
453 512 byte mem copy
234 512 byte reg copy
438 512 byte mem copy
250 512 byte reg copy
516 2048 byte mem copy
421 2048 byte reg copy
516 2048 byte mem copy
422 2048 byte reg copy
531 2048 byte mem copy
406 2048 byte reg copy
516 2048 byte mem copy
422 2048 byte reg copy
906 8192 byte mem copy
828 8192 byte reg copy
922 8192 byte mem copy
828 8192 byte reg copy
907 8192 byte mem copy
828 8192 byte reg copy
906 8192 byte mem copy
828 8192 byte reg copy
Press any key to continue ...

8 meg buffer
------------

297 32 byte mem copy
141 32 byte reg copy
281 32 byte mem copy
141 32 byte reg copy
297 32 byte mem copy
125 32 byte reg copy
296 32 byte mem copy
125 32 byte reg copy
250 128 byte mem copy
141 128 byte reg copy
234 128 byte mem copy
141 128 byte reg copy
234 128 byte mem copy
125 128 byte reg copy
250 128 byte mem copy
125 128 byte reg copy
297 512 byte mem copy
250 512 byte reg copy
282 512 byte mem copy
265 512 byte reg copy
328 512 byte mem copy
250 512 byte reg copy
282 512 byte mem copy
250 512 byte reg copy
296 2048 byte mem copy
407 2048 byte reg copy
297 2048 byte mem copy
406 2048 byte reg copy
297 2048 byte mem copy
406 2048 byte reg copy
297 2048 byte mem copy
406 2048 byte reg copy
500 8192 byte mem copy
813 8192 byte reg copy
484 8192 byte mem copy
812 8192 byte reg copy
500 8192 byte mem copy
813 8192 byte reg copy
484 8192 byte mem copy
813 8192 byte reg copy
Press any key to continue ...

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

j_groothu

Hi Hutch--,
         I have been reading these forums for some time, although i am a first time poster.  In addition to the cache and paging explanations you have... A pattern in those timing results looked familiar ( one algo being less effected by changes in the type of problem) from my comp sci days, so I looked it up and it MIGHT kindof partially explain what's going on. 

The "No Free Lunch Theorem" ( seriously...tehehe ) ... the way I understand it, says that A super specialised / customised / optimised algorithm ( like the reg copy version ) will excel in Its type of problem -- but when taken outside its problem domain will perform slightly less well than the general purpose ( REP MOVSD ) version ( which will be of more even performance over a wide range of use).  The description i found in a wiki http://en.wikipedia.org/wiki/No-free-lunch_theorem has a graph of this and goes on to say :
"For practising engineers and other optimization professionals, the theorem justifies the view that as much prior domain knowledge should be utilized as possible, and custom optimization routines constructed for particular domains. It may be thought of as a full employment theorem for optimization professionals."

So I guess It means now go and build an algo optimised for those far addresses....

Regards, Jason

hutch--

Hi Jason,

Welcome on board. It is an interesting result where the REP MOVSD instruction pair have special circuitry which seems to give it an advantage in addressing memory that is further apart than can be held in cache. What will be interesting when I get the time to play with it is using an SSE/2 instruction set where you can seperate read and write so that write does not go back through the cache.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

j_groothu

Indeed,  I'll be eagerly watching to see if if those sse methods perhaps speedup for those bigger buffers, could be  be a handy tool if some penalties are avoided. [ EDIT: it does indeed make sense to me that the reg copy would be aided by the cache for those close buffers, but hindered by it for the far ones, maybe the sse2 version can provide a happy medium for close and far ?, not too familiar with the sse instructions myself, may just do some more reading on them].

Happy New Year!  :bdg

Jason