The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: raymond on January 22, 2005, 05:23:12 AM

Title: Optimisation pitfall?
Post by: raymond on January 22, 2005, 05:23:12 AM
A few years ago, I devised the following code to extract the square root of 32-bit unsigned integers using only CPU instructions. It resulted in the sqrt32 function of my MixLib library of fixed-point math, uInt being the integer passed as the parameter.

    mov edx,uInt
    xor eax,eax
    mov ebx,40000000h
    mov ecx,16          ;********
sqr10:
    add eax,ebx
    cmp eax,edx
    ja  sqr20
    sub edx,eax
    add eax,ebx
    jnz sqr30
sqr20:
    sub eax,ebx
sqr30:
    shr eax,1
    shr ebx,2
    dec ecx             ;********
    jnz sqr10

    cmp eax,edx
    jae @F
    inc eax
  @@:   
    ret


I was recently having another critical look at the functions of that library when I noticed that the use of ECX as a counter in the above algo is not really necessary because EBX becomes 0 after the 16 iterations. Removing the two instructions commented with *** produced the same end results.

Curious about any beneficial effect those fewer instructions would have on the time of execution, I ran some tests with 1,000,000 cycles on a number in the range of 100,000,000. Results on a P3-550:

With the counter: 260 ms
Without the counter: 410 ms !!!!!!!!!!!!!!!!!

Would anybody have an explanation for such an increase after removing a single redundant instructions in the loop???

As a side note, unrolling the loop runs in 140 ms and using the FPU runs in 105 ms. The test included extracting the square root and storing the result in a memory variable.

Raymond
Title: Re: Optimisation pitfall?
Post by: Mark_Larson on January 22, 2005, 05:49:30 AM
I'll take a stab at it.  But I am not timing it to verify.  At least one of the things you are doing is breaking up a dependency.  In this section of code.  This cut and paste isn't sequential.  It is some instructions in the order they get executed in a loop.


sqr30:
    shr eax,1
    shr ebx,2
;    dec ecx             ;********
    jnz sqr10
sqr10:
    add eax,ebx       ;'ebx just got updated by the SHR EBX,2.  That is a causes a stall. ( read after write dependency penalty)
    cmp eax,edx



As a check.  Change the "dec ecx", to a "xor ecx,ecx".  Both should be close to the same speed, and should get rid of the depedencies.  As an additonal possible speed up try checking putting the XOR in between the two SHRs. 

    mov edx,uInt
    xor eax,eax
    mov ebx,40000000h
;    mov ecx,16          ;********
sqr10:
    add eax,ebx
    cmp eax,edx
    ja  sqr20
    sub edx,eax
    add eax,ebx
    jnz sqr30
sqr20:
    sub eax,ebx
sqr30:
    shr eax,1
    shr ebx,2
    xor ecx,ecx             ;********
    jnz sqr10

    cmp eax,edx
    jae @F
    inc eax
  @@:   
    ret

Title: Re: Optimisation pitfall?
Post by: raymond on January 22, 2005, 08:59:25 PM
I don't think that dependency is the cause for the following reasons:

a) Unrolling the loop puts the two instructions immediately following each other and the timing was significantly improved.
b) Replacing the shr ebx,2 instructions with 2 shr ebx,1 instructions brings the timing back down to the 280 ms level.
(Starting with EBX=80000000h and transferring one of the shr ebx,1 to the top of the loop inproves it to the 260 ms level, removing one small dependency between the 2 consecutive shr ebx instructions.)

The conditional jump prediction remains identical whether the dec ecx or the shr is used.

From those tests, the problem seems to stem from the shifting by more than one bit followed by the conditional jump. Maybe Agner could figure out that one if he ever reads this thread.

Raymond
Title: Re: Optimisation pitfall?
Post by: dioxin on January 22, 2005, 10:16:41 PM
Raymond,
I suspect that if you replace the 2 removed instructions with the appropriate number of NOPs that your timing will return to normal or improve slightly. If it does then it points to an instruction alignment problem in the shorter code.

Also, the DEC ecx overwrites the flags, without it the SHR ebx has to be executed before the jump can be taken. By including the DEC you end up executing the loop overhead at the same time as the previous calculation instead of after it.

Paul.
PS slightly related. For jumps like"jnz sqr10" does MASM always assemble the short form of the jump if it can or does it default to the long form and need to be told to use the short form?
Title: Re: Optimisation pitfall?
Post by: MichaelW on January 23, 2005, 01:26:45 AM
Replacing the removed instructions with an appropriate number of NOPs did not seem to restore the original timing. But after doing as Raymond described and substituting two shr ebx,1 instructions, moving one to the top of the loop, and initializing ebx to 80000000h, the overall timing was very close to the original. And after eliminating the add eax,ebx dependency by substituting an lea eax,[eax+ebx] further down and adjusting the locations of the first shr ebx,1 instruction, the timing on a P3 improved by ~25% relative to the original. But even so it cannot come close to the FP version.



[attachment deleted by admin]
Title: Re: Optimisation pitfall?
Post by: raymond on January 23, 2005, 05:22:52 AM
The unrolled loop is still the fastest, although about 40% slower than the FPU version. Replacing the "add eax,ebx" with "lea eax,[eax+ebx]" had no visible effect on the unrolled version. Replacing the "shr ebx,2" with two "shr ebx,1" in the unrolled version was detrimental, even when separated by the "shl eax,1".

Raymond
Title: Re: Optimisation pitfall?
Post by: bitRAKE on January 23, 2005, 07:36:49 AM
Might this have something to do with how the branch predictor works? I don't know about the P3, but the AMD chips have difficulty predicting three branches in eight bytes, iirc. MichaelW's comments would seem to indicate the code spacing is not the reason for the additional time, but a test case based on the function of the branch predictor would have to be devised to elimnate this explaination. Both the position of the branch code in memory and the distance between branches will impact code speed.
Title: Re: Optimisation pitfall?
Post by: Mark_Larson on January 26, 2005, 12:22:10 AM
I ran all this on my P4. 

Quote
Original unmodified code                  - 109 cycles
removing the loop counter ECX code - 156 cycles
unrolling the loop                            - 102 cycles
using fsqrt                                     - 27 cycles

On a P4 the unrolling came out slightly faster.  Here's the unrolled code.  I made one slight change where I do a "JNZ $+4", so that is why I wanted to post it.


isqrt2 proc uInt:DWORD
mov edx,uInt
xor eax,eax
mov ebx,40000000h
REPEAT 16
add eax,ebx
cmp eax,edx
ja @F
sub edx,eax
add eax,ebx
; jnz sqr30
jnz $+4 ;sub eax,ebx should be 2 bytes
@@:
sub eax,ebx
;sqr30:
shr eax,1
shr ebx,2
endm

cmp eax,edx
jae @F
inc eax
@@:
ret
isqrt2 endp


I am trying some variations to see if I can make it faster.
Title: Re: Optimisation pitfall?
Post by: MichaelW on January 26, 2005, 02:10:25 AM
Mark,

On a P3, the original code ran in 133 cycles, the best I could do without unrolling the loop in 97 cycles, and your version in 70 cycles (and fsqrt in a procedure with a local dword to store the result in 62 cycles).
Title: Re: Optimisation pitfall?
Post by: Mark_Larson on January 26, 2005, 02:44:16 AM
Quote from: MichaelW on January 26, 2005, 02:10:25 AM
Mark,

On a P3, the original code ran in 133 cycles, the best I could do without unrolling the loop in 97 cycles, and your version in 70 cycles (and fsqrt in a procedure with a local dword to store the result in 62 cycles).


8 cycles from a floating square root? wow  Time to get creative and come up with a faster solution.  I am working on one in my head ,just haven't typed it up yet.
Title: Re: Optimisation pitfall?
Post by: raymond on January 26, 2005, 04:33:20 AM
Mark

Your jnz $+4     ;sub eax,ebx should be 2 bytes is a conditional jump which may slow the procedure. An absolute jump may be preferable since the jump does not need to be conditional.

Raymond

Title: Re: Optimisation pitfall?
Post by: Mark_Larson on January 26, 2005, 04:40:14 AM
This came out the same speed on my P4, but it might be faster on P3.  Doing a table look up for the SHR.  I have two variations.  One can be used to break up dependencies.  Uncomment the mov esi,-4, and the add esi,4 that is earlier to try the other variant.


align 4
the_table dd 40000000h SHR 2
dd 40000000h SHR 4
dd 40000000h SHR 6
dd 40000000h SHR 8
dd 40000000h SHR 10
dd 40000000h SHR 12
dd 40000000h SHR 14
dd 40000000h SHR 16




isqrt2 proc uInt:DWORD
xor esi,esi
;mov esi,-4
mov edx,uInt
xor eax,eax
mov ebx,40000000h
REPEAT 16
add eax,ebx
;add esi,4
cmp eax,edx
ja @F
sub edx,eax
add eax,ebx
jnz $+4 ;sub eax,ebx should be 2 bytes
@@:
sub eax,ebx
;JNZ $+4 lands here
shr eax,1
mov ebx,[the_table+esi]    ;replaces the SHR EBX,2
add esi,4

endm

cmp eax,edx
jae @F
inc eax
@@:
ret
isqrt2 endp


Raymond, I changed the code to a unconditional jump.  Was only 1 cycle faster on my P4.  Good prediction logic.
Title: Re: Optimisation pitfall?
Post by: MichaelW on January 26, 2005, 05:00:55 AM
It is faster on a P3, 68 cycles as posted :U, but 76 for the other variant.

I changed the fsqrt procedure to use aligned static data to store the result, same 62 cycles??

And on my 97-cycle version I did a stupid that altered the calculated value. So my best without unrolling the loop was slower than the original.

Title: Re: Optimisation pitfall?
Post by: Jimg on January 26, 2005, 02:29:54 PM
Mark-

This last one is not giving the correct answers.  Try sqrt of 2147483647.  Gives 46337 rather than 46341.  (And takes three clock cycles longer than the previous version without the shift table on my Athlon).
Title: Re: Optimisation pitfall?
Post by: Mark_Larson on January 26, 2005, 04:00:07 PM
Quote from: Jimg on January 26, 2005, 02:29:54 PM
Mark-

This last one is not giving the correct answers.  Try sqrt of 2147483647.  Gives 46337 rather than 46341.  (And takes three clock cycles longer than the previous version without the shift table on my Athlon).

Thanks.  I don't actually care because it wasn't any faster than the version I had been working on.  I simply posted it, because it shows a method of getting rid of a SHR.  SHRs are slow on a P4.  The L1 latency for memory acceses is 2 cycles.  A SHR is 4 or 5 cycles ( I forget the exact # off the top of my head).  There are just other dependencies that keep it from being 2 or 3 cycles faster.
Title: Re: Optimisation pitfall?
Post by: Mark_Larson on January 26, 2005, 04:18:54 PM
Ok.  I macroed it.  Won't help you for non-runtime known values.  But it runs in 53 cycles on my P3 at work.  There are still lots of things I can do to tweak it.  This is just a first pass of playing with the code.


SQRT_MAC macro uInt:req
mov edx,uInt
xor eax,eax
mov ebx,40000000h

EBX_REG = 40000000h
EAX_REG = 0
EDX_REG = uInt
REPEAT 16
add eax,ebx ;first run we can skip the add, and do a mov eax,ebx, since eax is ZERO
EAX_REG = EAX_REG + EBX_REG
; cmp eax,edx
; ja @F
if EAX_REG LE EDX_REG
sub edx,eax
add eax,ebx
EDX_REG = EDX_REG - EAX_REG
EAX_REG = EAX_REG + EBX_REG
jmp $+4 ;sub eax,ebx should be 2 bytes
endif ; if EAX_REG LE EDX_REG
;@@:
sub eax,ebx
EAX_REG = EAX_REG - EBX_REG
;sqr30:
shr eax,1
shr ebx,2
EAX_REG = EAX_REG / 2
EBX_REG = EBX_REG / 4
endm

cmp eax,edx
jae @F
inc eax
@@:
; ret
endm
Title: Re: Optimisation pitfall?
Post by: Mark_Larson on January 26, 2005, 04:38:43 PM
I made a slight modificatoin to the code.  It now runs in 50 cycles on my P3.


SQRT_MAC macro uInt:req
mov edx,uInt
xor eax,eax
mov ebx,40000000h

EBX_REG = 40000000h
EAX_REG = 0
EDX_REG = uInt
REPEAT 16

if EAX_REG EQ 0
mov eax,ebx ;first run we can skip the add, and do a mov eax,ebx, since eax is ZERO
else
add eax,ebx ;first run we can skip the add, and do a mov eax,ebx, since eax is ZERO
endif ; if EAX_REG EQ 0

EAX_REG = EAX_REG + EBX_REG
; cmp eax,edx
; ja @F
if EAX_REG LE EDX_REG
sub edx,eax
add eax,ebx
EDX_REG = EDX_REG - EAX_REG
EAX_REG = EAX_REG + EBX_REG
jmp $+4 ;sub eax,ebx should be 2 bytes
endif ; if EAX_REG LE EDX_REG
;@@:
sub eax,ebx
EAX_REG = EAX_REG - EBX_REG
;sqr30:
shr eax,1
shr ebx,2
EAX_REG = EAX_REG / 2
EBX_REG = EBX_REG / 4
endm

cmp eax,edx
jae @F
inc eax
@@:
; ret
endm