Having used these macros, am I correct in assuming that their word & dword equivalents would be faster than their string counterparts or
would the inc or add instructions make them slower?
The string instructions can be faster under some circumstances, as long there is the right amount of data being handled - there's a cut-off where it becomes faster to do it the 'long' way (or is it the other way around :lol)
There is another thread about this... somewhere.
Tedd,
Are you saying that mov [edi],eax
add edi,4
is faster than stosd
Tedd,
I've searched for this other thread you mentioned, but can't find it maybe I'm putting in the wrong search parameters.
Neil,
There are a couple of special cases with string instructions, REP movsd and REP stosd, it only works with the REP prefix, separately they are very slow and should be avoided. The REP string instructions do outperform the normal integer instructions in some contexts, if both source and destination are not in cache at the same time they are faster as they appear to handle non-temporal writes in much the same way as the specialised SSE instructions where the normal interger instructions don't have that option.
Most of us grew up with the string instructions but since the early PIIs onwards they have not been competitive in most instances, loading a register and incrementing the index is almost always faster and sometimes by a large amount.
I couldn't think of a good real-world test for the macros, but on my P3 there appears to be no contest.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
buff db 8 dup(0)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 3000
counter_begin 1000, HIGH_PRIORITY_CLASS
mov esi, OFFSET buff
REPEAT 10
lodsb
ENDM
counter_end
print ustr$(eax), " cycles, lodsb",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov esi, OFFSET buff
REPEAT 10
lob
ENDM
counter_end
print ustr$(eax), " cycles, lob",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi, OFFSET buff
REPEAT 10
stosb
ENDM
counter_end
print ustr$(eax), " cycles, stosb",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi, OFFSET buff
REPEAT 10
stb
ENDM
counter_end
print ustr$(eax), " cycles, stb",13,10,13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
17 cycles, lodsb
5 cycles, lob
16 cycles, stosb
8 cycles, stb
And this is more or less of a worst-case comparison:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
buff1 dd 1000 dup(0)
buff2 dd 1000 dup(0)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 3000
counter_begin 1000, HIGH_PRIORITY_CLASS
mov esi, OFFSET buff1
mov edi, OFFSET buff2
mov ecx, 1000
rep movsd
counter_end
print ustr$(eax), " cycles",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov esi, OFFSET buff1
mov edi, OFFSET buff2
mov ecx, 1000
@@:
mov eax, [esi]
mov [edi], eax
dec ecx
jnz @B
counter_end
print ustr$(eax), " cycles",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov esi, OFFSET buff1
mov edi, OFFSET buff2
mov ecx, 1000
rep movsd
counter_end
print ustr$(eax), " cycles",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov esi, OFFSET buff1
mov edi, OFFSET buff2
mov ecx, 1000
@@:
mov eax, [esi]
mov [edi], eax
dec ecx
jnz @B
counter_end
print ustr$(eax), " cycles",13,10,13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
926 cycles
3005 cycles
924 cycles
3005 cycles
Thanks Hutch & Michael,
Michael your demo code gives a great illustration of the speed difference, more than 3 times as fast using the lob macro, it also demostrates how the use of rep speeds up the string instructions. Now I have one more question regarding writing macros to replace stosw, stosd etc, are 2 incs or 4 incs faster or slower than adding the appropriate offset i.e. there must be a time when n number of incs becomes slower than add n.
I think add would probably be faster than two or more incs.
Well, add it is then :U
Mind you, I have a few instances where I use std, this is starting to get complicated :toothy
Neil,
I think it all depends on what you are doing
mov ecx count
dec ecx
mov esi,OFFSET str
mov edi,OFFSET buf
@@:
mov eax,[esi+ecx*4]
mov [edi+ecx*4],eax
dec ecx
jns @b
Depending on what you're doing, you only need an increment or a decrement.
Dave
Dave,
That's an interesting code snippet, but (Correct me if I'm wrong) according to Hutch & Michael rep movsd would be much quicker.