The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: a_h on September 17, 2005, 05:55:40 PM

Title: loop speed question
Post by: a_h on September 17, 2005, 05:55:40 PM
Hi!

Which loop implementation do you consider to be faster on the P4 (have no time to try it myself)


.xxx
...
sub eax, 1
jnz .xxx


or (I would prefer that version, counter in ecx instead of eax)


repeat
...
.untilcxz


Thanks for your input!

Have a nice evening, Hannes
Title: Re: loop speed question
Post by: roticv on September 19, 2005, 06:25:40 AM
Does .untilcxz uses jecxz? If it does, better to use the former.
Title: Re: loop speed question
Post by: MichaelW on September 19, 2005, 07:01:34 AM
.UNTILCXZ uses the LOOP instruction. On a P3 SUB/JNZ is almost 3x faster than .REPEAT/UNTILCXZ.


Title: Re: loop speed question
Post by: Snouphruh on September 19, 2005, 07:08:52 AM
I recommend you to use DEC and JNZ instructions.
example:

                    mov ecx, <nCount>
myLoop:
                    ...
                    dec ecx
                    jnz myLoop


try to use no this:

                    mov ecx, <nCount>
myLoop:
                    ...
                    loop myLoop

'cause LOOP instruction takes more CPU time than DEC + JNZ. DEC + JNZ are paired, e. g. these 2 instructions take 1 CPU clock (LOOP takes 3 ones)
Title: Re: loop speed question
Post by: a_h on September 19, 2005, 07:34:30 AM
Thanks for your replies! :U

Actually I thought the masm macro repeat/unitlcxz would expand to dec/jnz, but I'm wrong!

@Snouphruh: according to Intel's manuals and Agner Fogs'pdf inc/dec is not optimal for the P4, that's the reason I substitute it with sub/add (as suggested by these docs). When I have the time, I will test wether this replacement is really any good for performance!

Cheers, Hannes
Title: Re: loop speed question
Post by: Snouphruh on September 19, 2005, 11:48:09 AM
ok.
when you got the results write back to me, please. ok?

ps: I have AMD Athlon XP 2500+ @ 2244 MHz.
Title: Re: loop speed question
Post by: roticv on September 19, 2005, 02:49:35 PM
loop is a complex opcode. Intel optimisation manual for P4 recommend people not to use it.
Title: Re: loop speed question
Post by: Eóin on September 19, 2005, 07:06:46 PM
And slightly off the topic, but don't forget the neat little trick for moving up through an arry from offset 0 to the end eg;
mov ecx,-256*4
lp: mov eax,[Array+256*4+ecx] ; or something

add ecx,4
js lp
Title: Re: loop speed question
Post by: a_h on September 21, 2005, 02:31:13 PM
Ok here the results:

Using a small loop (stdcall) without memory accesses like


looptest0 proc
mov ecx, 10000
L1:
mov ebx, 1
add ebx, ecx
sub ecx,1
jnz L1
  ret
looptest0 endp


yields following results on a P4:

using jnz/sub (code above): ~30 000 cycles
using loop: 34 400 cycles
using jnz/dec: 42 250 cycles
using repeat: 43 800 cycles

all numbers are averaged from 4 runs (done 3times to check).

Thanks for reading, Hannes