Hi!
Which loop implementation do you consider to be faster on the P4 (have no time to try it myself)
.xxx
...
sub eax, 1
jnz .xxx
or (I would prefer that version, counter in ecx instead of eax)
repeat
...
.untilcxz
Thanks for your input!
Have a nice evening, Hannes
Does .untilcxz uses jecxz? If it does, better to use the former.
.UNTILCXZ uses the LOOP instruction. On a P3 SUB/JNZ is almost 3x faster than .REPEAT/UNTILCXZ.
I recommend you to use DEC and JNZ instructions.
example:
mov ecx, <nCount>
myLoop:
...
dec ecx
jnz myLoop
try to use no this:
mov ecx, <nCount>
myLoop:
...
loop myLoop
'cause LOOP instruction takes more CPU time than DEC + JNZ. DEC + JNZ are paired, e. g. these 2 instructions take 1 CPU clock (LOOP takes 3 ones)
Thanks for your replies! :U
Actually I thought the masm macro repeat/unitlcxz would expand to dec/jnz, but I'm wrong!
@Snouphruh: according to Intel's manuals and Agner Fogs'pdf inc/dec is not optimal for the P4, that's the reason I substitute it with sub/add (as suggested by these docs). When I have the time, I will test wether this replacement is really any good for performance!
Cheers, Hannes
ok.
when you got the results write back to me, please. ok?
ps: I have AMD Athlon XP 2500+ @ 2244 MHz.
loop is a complex opcode. Intel optimisation manual for P4 recommend people not to use it.
And slightly off the topic, but don't forget the neat little trick for moving up through an arry from offset 0 to the end eg;
mov ecx,-256*4
lp: mov eax,[Array+256*4+ecx] ; or something
add ecx,4
js lp
Ok here the results:
Using a small loop (stdcall) without memory accesses like
looptest0 proc
mov ecx, 10000
L1:
mov ebx, 1
add ebx, ecx
sub ecx,1
jnz L1
ret
looptest0 endp
yields following results on a P4:
using jnz/sub (code above): ~30 000 cycles
using loop: 34 400 cycles
using jnz/dec: 42 250 cycles
using repeat: 43 800 cycles
all numbers are averaged from 4 runs (done 3times to check).
Thanks for reading, Hannes