News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

loop instruction

Started by zemtex, December 22, 2011, 01:24:45 PM

Previous topic - Next topic

zemtex

What is your opinion on:

loop

vs

sub, Jcc

In the intel optimization manual, loop has a latency of 8 cycles and throughput of 1.5 cycles

sub has a latency of 1 and throughput of 0.5 cycles
Jcc has throughput of 0.5 cycles

Which means that

sub + Jcc has only 1 cycle throughput together since there is dependencies between sub and Jcc it will accumulate to at least 2 cycles
loop has 1.5 cycles throughput

In total, a loop with the Loop instruction has a overhead that is 75% that of sub + Jcc. In addition, since the loop instruction is dependent on ecx, i'm not sure how efficient it is. But the loop instruction is not being executed after one another, so the latency will apply every iteration. Correct me if im wrong.

So 8 cycles overhead for the loop instruction and
2 cycles for sub/Jcc

(assuming the loop content is over 8 cycles long)
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

dedndave

LOOP is slow - but still handy when speed isn't critical
the same is true for JECXZ

zemtex

these instructions should complete in:

mov eax, ecx     
mov edx, ebx
mov esi, edi
mov ebp, esp

This should finish in 2.5 cycles.

the funny thing is that these:

mov eax, ecx
mov eax, edx

will finish in 2 cycles.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

jj2007

Test your luck... for my puter it's 4 cycles slower.

zemtex

If I am not mistaken I used 06_2FH which is code for Sandy Bridge, there is probably a bit different timings in yours. Or perhaps I miscalculated it? However, the optimization manual declares that the timings is not guaranteed in real practical examples.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

zemtex

11033 loop
7961 jnz

11047 loop
7838 jnz
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

jj2007

I had randomly thrown together some instructions to get the 8 cycles together, but maybe there is a combination that makes loop faster than jnz...

EatCycles MACRO
  push eax
  push edx
  xchg eax, edx
  inc eax
  dec edx
  nops 5
  pop edx
  xchg eax, ebx
  mov ebx, edx
  sub ebx, ecx
  pop eax
  sub edx, eax
  add eax, edx
ENDM

dedndave

mov eax, ecx
mov eax, edx


some processors may foresee that the first instruction need not be executed

KeepingRealBusy

Quote from: dedndave on December 22, 2011, 06:20:57 PM
mov eax, ecx
mov eax, edx


some processors may foresee that the first instruction need not be executed

And others will see it as a stall (both target eax).

Dave.