News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

loop speed question

Started by a_h, September 17, 2005, 05:55:40 PM

Previous topic - Next topic

a_h

Hi!

Which loop implementation do you consider to be faster on the P4 (have no time to try it myself)


.xxx
...
sub eax, 1
jnz .xxx


or (I would prefer that version, counter in ecx instead of eax)


repeat
...
.untilcxz


Thanks for your input!

Have a nice evening, Hannes

roticv

Does .untilcxz uses jecxz? If it does, better to use the former.

MichaelW

.UNTILCXZ uses the LOOP instruction. On a P3 SUB/JNZ is almost 3x faster than .REPEAT/UNTILCXZ.


eschew obfuscation

Snouphruh

I recommend you to use DEC and JNZ instructions.
example:

                    mov ecx, <nCount>
myLoop:
                    ...
                    dec ecx
                    jnz myLoop


try to use no this:

                    mov ecx, <nCount>
myLoop:
                    ...
                    loop myLoop

'cause LOOP instruction takes more CPU time than DEC + JNZ. DEC + JNZ are paired, e. g. these 2 instructions take 1 CPU clock (LOOP takes 3 ones)

a_h

Thanks for your replies! :U

Actually I thought the masm macro repeat/unitlcxz would expand to dec/jnz, but I'm wrong!

@Snouphruh: according to Intel's manuals and Agner Fogs'pdf inc/dec is not optimal for the P4, that's the reason I substitute it with sub/add (as suggested by these docs). When I have the time, I will test wether this replacement is really any good for performance!

Cheers, Hannes

Snouphruh

ok.
when you got the results write back to me, please. ok?

ps: I have AMD Athlon XP 2500+ @ 2244 MHz.

roticv

loop is a complex opcode. Intel optimisation manual for P4 recommend people not to use it.

Eóin

And slightly off the topic, but don't forget the neat little trick for moving up through an arry from offset 0 to the end eg;
mov ecx,-256*4
lp: mov eax,[Array+256*4+ecx] ; or something

add ecx,4
js lp

a_h

Ok here the results:

Using a small loop (stdcall) without memory accesses like


looptest0 proc
mov ecx, 10000
L1:
mov ebx, 1
add ebx, ecx
sub ecx,1
jnz L1
  ret
looptest0 endp


yields following results on a P4:

using jnz/sub (code above): ~30 000 cycles
using loop: 34 400 cycles
using jnz/dec: 42 250 cycles
using repeat: 43 800 cycles

all numbers are averaged from 4 runs (done 3times to check).

Thanks for reading, Hannes