News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

zero fill mem

Started by ragdog, February 19, 2007, 08:45:07 PM

Previous topic - Next topic

asmfan

Quote from: thomas_remkus on June 01, 2007, 12:52:53 PM
"dec" until you get to 0 you do not need to use the "cmp" because "dec" will fill in the 0 flag and you can "jnz" back to your label without the additional instruction.
Of course you're right. but i'm speaking about best memory access, not about best loop organization;) wich you are talking about.
Say its better make cld before rep stos* than std due to performance. It is similar when organizing own memory access routine.

Below my app. testing RAM fill speed, written in fasm, requires fasm to recompile and win & sse2 to run (can be done with mere sse with different packed type).

[attachment deleted by admin]
Russia is a weird place

Subhadeep.Ghosh

Hello,

I found this to be an interesting discussion, so I decided to contribute to it as well with whatever little I've got  :bg.

In C/C++ I've been using my own versions of memcpy, memset and the like. I think I use the same optimized algorithm as the one which the C/C++ standard library uses, but still I derive a kick out of using my own libraries when ever I can.

The functions memcpy, memset and ZeroMemory could be group under the same category wherein you are modifying a block of memory. In this case let's consider a computer which can handle 32bits (4 bytes) of data in a single clock cycle. In such a situation there might be two possibilities - either the size of the block of memory (in bytes) is divisible by 4 or it is not divisible by 4.

In case if the size of the block of memory is divisible by 4, then we could set ecx to (size of the memory block) >> 2 and do a REP STOSD. In case the size of the block of memory is not divisible by 4, then at first we could set ecx to (size of the memory block) >> 2 and do a REP STOSD and then set ecx to (size of the memory block) % 4 and then do a REP STOSB.

According to me, this is the best 32bits memory manipulation. The same algorithm could be extrapolated for 64bits and 128bits as well.

Regards,
Subhadeep Ghosh