hi
I have times ask to zero mem function which is better?
RtlZeroMemory or ZeroMem
to fill buffer with zero??
thanks in forward
There is a thread here, http://www.masm32.com/board/index.php?topic=6576.0
The thread discusses which one is the fastest.
I believe you would want to use RtlZeroMemory.
8 out of 10 jag is right...
I'm no expert but where i can replace an API, i will. I think i found RtlZeroMemory hard to replace once for whatever reason and only bitRAKE code did the job for me at the time when all others i tried fail. I could have been doing something wrong but whatever the case the code below solved that problem. If this is what you mean by zero out the buffer, both will wipe *the entire buffer space* with char's being in it or not.
I changed Qages code a long time ago because i wanted eax to always be free. . I use it 95% of the time and never had a problem.
Qages once said "nothing is faster than a JUMP"
; ############################
Qages_Clear_Buff PROC len:DWORD,scr:DWORD
xor edx, edx ; (dh/dl)
mov edx, scr ; Qages cleanbuff
xor ebx,ebx
mov BYTE PTR [edx],0
@@:
inc ebx
mov BYTE PTR [edx+ebx],0
cmp ebx, len; -1
jne @B
ret
c_Qages_Clear_Buff ENDP
; ############################
I once stumbled on some weird thing coding like WriteProcessMemory without even calling the API and it worked. I was playing with RtlZeroMemory and these other codes, and ONLY bitRAKE code did the job at the time. I forgot how i did it and loss what i was doing and why.
Anyway I use this when i mean business, meaning RIGHT NOW with no tricks allowed and it NEVER FAIL...
; ....................... bitRAKE clean buffer
xor eax,eax
mov edi, offset hFile
mov ecx, SIZEOF hFile
rep stosb
The recent thread on zeroing memory made most of these questions clear. If its a large block to zero, use STOSD, if its under about 700 bytes use something like memfill in the masm32 library. An MMX version will be slightly faster on Intel hardware but slightly slower on AMD.
any memory filler program is probably faster using stos* like...
zeromemory proc uses eax ecx edi, memoryarea:dword, memorysize:dword
local bytesremaining:dword
xor eax,eax
mov edi, memoryarea
mov ecx, memorysize
mov bytesremaining, ecx
mov ecx, bytesremaining
cmp ecx, 0
je finished
shl ecx, 2;divide by 4
cmp ecx, 0
je stoswmode
sub bytesremaining, ecx
repnz stosd
stoswmode:
mov ecx, bytesremaining
cmp ecx, 0
je stosbmode
shl ecx, 1 ; divide by 2
cmp ecx, 0
je stosbmode
sub bytesremaining, ecx
repnz stosw
stosbmode:
mov ecx, bytesremaining
cmp ecx, 0
je finished
repnz stosb
ret
zeromemory endp
think microsoft do something similar, haven't tested that code either, should work though
evlncrn8,
You should benchmark it against the collection of algos in the thread mentioned above, over about 700 bytes REP STOSD leaves the rest behind.
haven't really got time atm :(, 'leaves the rest behind' means its slower?
Perhaps if you read the thread you would understand what "leaves the rest behind" meant.
and you could have explained instead of going 'look at the thread', its a simple question.
and as for the thread.. different results, different pc's, and most likely different types of
memory tested.. stack, fixed, aligned, not aligned, its all just results which mean nothing
without a relative base to work from...ie: not conclusive
Everybody has a theory, feel free to put yours to the test. That is what objective testing is about. If you think you have a more accurate benchmarking method, feel free to demonstrate it.
:U thanks to all for the information
greets
ragdog
Quote
I changed Qages code a long time ago because i wanted eax to always be free. . I use it 95% of the time and never had a problem.
Qages once said "nothing is faster than a JUMP"
I didn't know i had a fan.
QuoteI didn't know i had a fan.
I did not notice this until now. I went to doing something else i guest.
Yes You Did. ... and still Do, because i still read old threads where you once posted. Haven't seen anything new under Qage lately until now...
Code i used presented by other coders i never forget. Specially when that coder make strong comments on why it works so well...Also, it was people like you who kelp me interested in ASM in the first place even tho i was slow about it.
Anyway, this question should be in relations to all functions like yours but i use this example:
If i had only 40 bytes in a 256 byte size buffer and i use sizeof ... Am im insured that it will clear only to the first zero it encounter or do it clean to the very end of the buffer...
256 repeats instead of 41 repeats using this code: ... I think it would be returning after hitting the 41 byte but i need to be 100% sure. This is why im asking.
TEMP_256 db 256 dup(?)
Only 40 bytes is used:
xor eax,eax
mov edi, offset TEMP_256
mov ecx, sizeof TEMP_256
rep stosb
And if i want it to clean the entire buffer no matter what this would be one way it can be done... With only 40 bytes inside the buffer am im insured that it will clean to the bitter end of the buffer... stepping thru all 256 bytes. Or will it REALLY still return after hitting the 41st buffer context. Returning well before the expected 256 hit.
TEMP_256 db 256 dup(?)
Only 40 bytes is used:
xor eax,eax
mov edi, offset TEMP_256
mov ecx, 256
rep stosb
Thanks in advance
Just need to know for sure... and understand what SIZEOF actually dose, sizeof BYTES or sizeof BUFFER
really lots of ways to do the job. Using cache (mov, movs*, movq, movdqa) or not (movnti, movntps[d], movntdq) depending on required task. Noncached access on large amounts of memory is much (appr. 3x times on my tests) faster. One advice ascending access (from less addresses to higer ones) is faster.
If you start your loop at with the counter at the highest value and "dec" until you get to 0 you do not need to use the "cmp" because "dec" will fill in the 0 flag and you can "jnz" back to your label without the additional instruction.
If this is wrong, please let me know. I have a current project with a loop that some experts are helping me though right now. This project does not use this technique because I can't seem to get so much else working I don't want to mess with this too. But I think for performance this is correct.
Quote from: thomas_remkus on June 01, 2007, 12:52:53 PM
"dec" until you get to 0 you do not need to use the "cmp" because "dec" will fill in the 0 flag and you can "jnz" back to your label without the additional instruction.
Of course you're right. but i'm speaking about best memory access, not about best loop organization;) wich you are talking about.
Say its better make cld before rep stos* than std due to performance. It is similar when organizing own memory access routine.
Below my app. testing RAM fill speed, written in fasm, requires fasm to recompile and win & sse2 to run (can be done with mere sse with different packed type).
[attachment deleted by admin]
Hello,
I found this to be an interesting discussion, so I decided to contribute to it as well with whatever little I've got :bg.
In C/C++ I've been using my own versions of memcpy, memset and the like. I think I use the same optimized algorithm as the one which the C/C++ standard library uses, but still I derive a kick out of using my own libraries when ever I can.
The functions memcpy, memset and ZeroMemory could be group under the same category wherein you are modifying a block of memory. In this case let's consider a computer which can handle 32bits (4 bytes) of data in a single clock cycle. In such a situation there might be two possibilities - either the size of the block of memory (in bytes) is divisible by 4 or it is not divisible by 4.
In case if the size of the block of memory is divisible by 4, then we could set ecx to (size of the memory block) >> 2 and do a REP STOSD. In case the size of the block of memory is not divisible by 4, then at first we could set ecx to (size of the memory block) >> 2 and do a REP STOSD and then set ecx to (size of the memory block) % 4 and then do a REP STOSB.
According to me, this is the best 32bits memory manipulation. The same algorithm could be extrapolated for 64bits and 128bits as well.
Regards,
Subhadeep Ghosh