Optimisation pitfall?

raymond · January 22, 2005, 05:23:12 AM

A few years ago, I devised the following code to extract the square root of 32-bit unsigned integers using only CPU instructions. It resulted in the sqrt32 function of my MixLib library of fixed-point math, uInt being the integer passed as the parameter.

Code Select

    mov edx,uInt
    xor eax,eax
    mov ebx,40000000h
    mov ecx,16          ;********
sqr10:
    add eax,ebx
    cmp eax,edx
    ja  sqr20
    sub edx,eax
    add eax,ebx
    jnz sqr30
sqr20:
    sub eax,ebx
sqr30:
    shr eax,1
    shr ebx,2
    dec ecx             ;********
    jnz sqr10

    cmp eax,edx
    jae @F
    inc eax
  @@:    
    ret

I was recently having another critical look at the functions of that library when I noticed that the use of ECX as a counter in the above algo is not really necessary because EBX becomes 0 after the 16 iterations. Removing the two instructions commented with *** produced the same end results.

Curious about any beneficial effect those fewer instructions would have on the time of execution, I ran some tests with 1,000,000 cycles on a number in the range of 100,000,000. Results on a P3-550:

With the counter: 260 ms
Without the counter: 410 ms !!!!!!!!!!!!!!!!!

Would anybody have an explanation for such an increase after removing a single redundant instructions in the loop???

As a side note, unrolling the loop runs in 140 ms and using the FPU runs in 105 ms. The test included extracting the square root and storing the result in a memory variable.

Raymond

Mark_Larson · January 22, 2005, 05:49:30 AM

I'll take a stab at it. But I am not timing it to verify. At least one of the things you are doing is breaking up a dependency. In this section of code. This cut and paste isn't sequential. It is some instructions in the order they get executed in a loop.

Code Select


sqr30:
    shr eax,1
    shr ebx,2
;    dec ecx             ;********
    jnz sqr10
sqr10:
    add eax,ebx       ;'ebx just got updated by the SHR EBX,2.  That is a causes a stall. ( read after write dependency penalty)
    cmp eax,edx

As a check. Change the "dec ecx", to a "xor ecx,ecx". Both should be close to the same speed, and should get rid of the depedencies. As an additonal possible speed up try checking putting the XOR in between the two SHRs.

Code Select

    mov edx,uInt
    xor eax,eax
    mov ebx,40000000h
;    mov ecx,16          ;********
sqr10:
    add eax,ebx
    cmp eax,edx
    ja  sqr20
    sub edx,eax
    add eax,ebx
    jnz sqr30
sqr20:
    sub eax,ebx
sqr30:
    shr eax,1
    shr ebx,2
    xor ecx,ecx             ;********
    jnz sqr10

    cmp eax,edx
    jae @F
    inc eax
  @@:    
    ret

raymond · January 22, 2005, 08:59:25 PM

I don't think that dependency is the cause for the following reasons:

a) Unrolling the loop puts the two instructions immediately following each other and the timing was significantly improved.
b) Replacing the shr ebx,2 instructions with 2 shr ebx,1 instructions brings the timing back down to the 280 ms level.
(Starting with EBX=80000000h and transferring one of the shr ebx,1 to the top of the loop inproves it to the 260 ms level, removing one small dependency between the 2 consecutive shr ebx instructions.)

The conditional jump prediction remains identical whether the dec ecx or the shr is used.

From those tests, the problem seems to stem from the shifting by more than one bit followed by the conditional jump. Maybe Agner could figure out that one if he ever reads this thread.

Raymond

dioxin · January 22, 2005, 10:16:41 PM

Raymond,
I suspect that if you replace the 2 removed instructions with the appropriate number of NOPs that your timing will return to normal or improve slightly. If it does then it points to an instruction alignment problem in the shorter code.

Also, the DEC ecx overwrites the flags, without it the SHR ebx has to be executed before the jump can be taken. By including the DEC you end up executing the loop overhead at the same time as the previous calculation instead of after it.

Paul.
PS slightly related. For jumps like"jnz sqr10" does MASM always assemble the short form of the jump if it can or does it default to the long form and need to be told to use the short form?

MichaelW · January 23, 2005, 01:26:45 AM

Replacing the removed instructions with an appropriate number of NOPs did not seem to restore the original timing. But after doing as Raymond described and substituting two shr ebx,1 instructions, moving one to the top of the loop, and initializing ebx to 80000000h, the overall timing was very close to the original. And after eliminating the add eax,ebx dependency by substituting an lea eax,[eax+ebx] further down and adjusting the locations of the first shr ebx,1 instruction, the timing on a P3 improved by ~25% relative to the original. But even so it cannot come close to the FP version.

[attachment deleted by admin]

raymond · January 23, 2005, 05:22:52 AM

The unrolled loop is still the fastest, although about 40% slower than the FPU version. Replacing the "add eax,ebx" with "lea eax,[eax+ebx]" had no visible effect on the unrolled version. Replacing the "shr ebx,2" with two "shr ebx,1" in the unrolled version was detrimental, even when separated by the "shl eax,1".

Raymond

bitRAKE · January 23, 2005, 07:36:49 AM

Might this have something to do with how the branch predictor works? I don't know about the P3, but the AMD chips have difficulty predicting three branches in eight bytes, iirc. MichaelW's comments would seem to indicate the code spacing is not the reason for the additional time, but a test case based on the function of the branch predictor would have to be devised to elimnate this explaination. Both the position of the branch code in memory and the distance between branches will impact code speed.

Mark_Larson · January 26, 2005, 12:22:10 AM

I ran all this on my P4.

Quote
Original unmodified code - 109 cycles
removing the loop counter ECX code - 156 cycles
unrolling the loop - 102 cycles
using fsqrt - 27 cycles

On a P4 the unrolling came out slightly faster. Here's the unrolled code. I made one slight change where I do a "JNZ $+4", so that is why I wanted to post it.

Code Select


isqrt2 proc uInt:DWORD
	mov edx,uInt
	xor eax,eax
	mov ebx,40000000h
REPEAT 16
	add eax,ebx
	cmp eax,edx
	ja @F
	sub edx,eax
	add eax,ebx
;	jnz sqr30
	jnz $+4		;sub eax,ebx should be 2 bytes
@@:
	sub eax,ebx
;sqr30:
	shr eax,1
	shr ebx,2
endm

	cmp eax,edx
	jae @F
	inc eax
@@: 
	ret
isqrt2 endp

I am trying some variations to see if I can make it faster.

MichaelW · January 26, 2005, 02:10:25 AM

Mark,

On a P3, the original code ran in 133 cycles, the best I could do without unrolling the loop in 97 cycles, and your version in 70 cycles (and fsqrt in a procedure with a local dword to store the result in 62 cycles).

Mark_Larson · January 26, 2005, 02:44:16 AM

Quote from: MichaelW on January 26, 2005, 02:10:25 AM
Mark,

On a P3, the original code ran in 133 cycles, the best I could do without unrolling the loop in 97 cycles, and your version in 70 cycles (and fsqrt in a procedure with a local dword to store the result in 62 cycles).

8 cycles from a floating square root? wow Time to get creative and come up with a faster solution. I am working on one in my head ,just haven't typed it up yet.

raymond · January 26, 2005, 04:33:20 AM

Mark

Your jnz $+4 ;sub eax,ebx should be 2 bytes is a conditional jump which may slow the procedure. An absolute jump may be preferable since the jump does not need to be conditional.

Raymond

Mark_Larson · January 26, 2005, 04:40:14 AM

This came out the same speed on my P4, but it might be faster on P3. Doing a table look up for the SHR. I have two variations. One can be used to break up dependencies. Uncomment the mov esi,-4, and the add esi,4 that is earlier to try the other variant.

Code Select


align 4
	the_table	dd	40000000h SHR 2
			dd	40000000h SHR 4
			dd	40000000h SHR 6
			dd	40000000h SHR 8
			dd	40000000h SHR 10
			dd	40000000h SHR 12
			dd	40000000h SHR 14
			dd	40000000h SHR 16




isqrt2 proc uInt:DWORD
xor esi,esi
;mov esi,-4
	mov edx,uInt
	xor eax,eax
	mov ebx,40000000h
REPEAT 16
	add eax,ebx
;add	esi,4
	cmp eax,edx
	ja @F
	sub edx,eax
	add eax,ebx
	jnz $+4		;sub eax,ebx should be 2 bytes
@@:
	sub eax,ebx
;JNZ $+4 lands here
	shr eax,1
mov	ebx,[the_table+esi]    ;replaces the SHR EBX,2
add	esi,4

endm

	cmp eax,edx
	jae @F
	inc eax
@@: 
	ret
isqrt2 endp

Raymond, I changed the code to a unconditional jump. Was only 1 cycle faster on my P4. Good prediction logic.

MichaelW · January 26, 2005, 05:00:55 AM

It is faster on a P3, 68 cycles as posted :U, but 76 for the other variant.

I changed the fsqrt procedure to use aligned static data to store the result, same 62 cycles??

And on my 97-cycle version I did a stupid that altered the calculated value. So my best without unrolling the loop was slower than the original.

Jimg · January 26, 2005, 02:29:54 PM

Mark-

This last one is not giving the correct answers. Try sqrt of 2147483647. Gives 46337 rather than 46341. (And takes three clock cycles longer than the previous version without the shift table on my Athlon).

Mark_Larson · January 26, 2005, 04:00:07 PM

Quote from: Jimg on January 26, 2005, 02:29:54 PM
Mark-

This last one is not giving the correct answers. Try sqrt of 2147483647. Gives 46337 rather than 46341. (And takes three clock cycles longer than the previous version without the shift table on my Athlon).

Thanks. I don't actually care because it wasn't any faster than the version I had been working on. I simply posted it, because it shows a method of getting rid of a SHR. SHRs are slow on a P4. The L1 latency for memory acceses is 2 cycles. A SHR is 4 or 5 cycles ( I forget the exact # off the top of my head). There are just other dependencies that keep it from being 2 or 3 cycles faster.

News:

Optimisation pitfall?

bitRAKE