loop speed question

a_h · September 17, 2005, 05:55:40 PM

Hi!

Which loop implementation do you consider to be faster on the P4 (have no time to try it myself)


.xxx
...
sub eax, 1
jnz .xxx

or (I would prefer that version, counter in ecx instead of eax)

Code Select


repeat
...
.untilcxz

Thanks for your input!

Have a nice evening, Hannes

roticv · September 19, 2005, 06:25:40 AM

Does .untilcxz uses jecxz? If it does, better to use the former.

MichaelW · September 19, 2005, 07:01:34 AM

.UNTILCXZ uses the LOOP instruction. On a P3 SUB/JNZ is almost 3x faster than .REPEAT/UNTILCXZ.

Snouphruh · September 19, 2005, 07:08:52 AM

I recommend you to use DEC and JNZ instructions.
example:

Code Select


                    mov ecx, <nCount>
myLoop:
                    ...
                    dec ecx
                    jnz myLoop

try to use no this:

Code Select


                    mov ecx, <nCount>
myLoop:
                    ...
                    loop myLoop

'cause LOOP instruction takes more CPU time than DEC + JNZ. DEC + JNZ are paired, e. g. these 2 instructions take 1 CPU clock (LOOP takes 3 ones)

a_h · September 19, 2005, 07:34:30 AM

Thanks for your replies! :U

Actually I thought the masm macro repeat/unitlcxz would expand to dec/jnz, but I'm wrong!

@Snouphruh: according to Intel's manuals and Agner Fogs'pdf inc/dec is not optimal for the P4, that's the reason I substitute it with sub/add (as suggested by these docs). When I have the time, I will test wether this replacement is really any good for performance!

Cheers, Hannes

Snouphruh · September 19, 2005, 11:48:09 AM

ok.
when you got the results write back to me, please. ok?

ps: I have AMD Athlon XP 2500+ @ 2244 MHz.

roticv · September 19, 2005, 02:49:35 PM

loop is a complex opcode. Intel optimisation manual for P4 recommend people not to use it.

Eóin · September 19, 2005, 07:06:46 PM

And slightly off the topic, but don't forget the neat little trick for moving up through an arry from offset 0 to the end eg;

Code Select

mov ecx,-256*4
lp: mov eax,[Array+256*4+ecx] ; or something

add ecx,4
js lp

a_h · September 21, 2005, 02:31:13 PM

Ok here the results:

Using a small loop (stdcall) without memory accesses like

Code Select


looptest0 proc
	mov		ecx, 10000
L1:
	mov		ebx, 1
	add		ebx, ecx
	sub		ecx,1
	jnz		L1
  ret
looptest0 endp

yields following results on a P4:

using jnz/sub (code above): ~30 000 cycles
using loop: 34 400 cycles
using jnz/dec: 42 250 cycles
using repeat: 43 800 cycles

all numbers are averaged from 4 runs (done 3times to check).

Thanks for reading, Hannes

News:

loop speed question

a_h

roticv

MichaelW

Snouphruh

a_h

Snouphruh

roticv

Eóin

a_h