Print Page - FPU speed mysteries

Title: FPU speed mysteries
Post by: jj2007 on May 23, 2008, 01:27:25 PM

I am trying to develop some speedy FPU macros. Out of curiosity, I benchmarked them against a trusty old 16-bit BASIC; here is my test suite:

Code Select

CrLf$=CHR$(13, 10), gRes=0
G1=12.34, G2=543.21, f1=11.11
Ti%=GetTickCount(), L%=1000000
FOR ebx=0 TO L%
  gRes=gRes+5.5-G2*(f1+380.78*G1)+G2*f1+ebx*1.44444
  gRes=gRes+5.5-G2*(f1+(34.5+G1))+G2*f1+ebx*1.503
  gRes=gRes+5.5-G2*(f1+(3456.543/G1))+G2*f1+ebx*1.5
  gRes=gRes+5.5-G2*(f1+34.5*G1)-G2*f1+ebx*1.5
NEXT ebx
Ti%=GetTickCount()-Ti%
~MessageBox(0,"It took me a while for"+CrLf$+STR$(L%)+" loops: "+STR$(Ti%)+" ms", "Benchmark:", MB_OK)

Now these 1 Mio loops take slightly less than 2 seconds. The same in assembler: slightly less than 2 seconds. So far, so good - it seems that 16-bit code knew already how to access the FPU.
What I forgot to mention, though, is that these 2 seconds refer to interpreted code; the same code compiled runs in about 200 ms... ten times faster than my pretty slim assembler routines. Here is a snippet - it could be optimised here and there, but a factor of 10???

Code Select

9B            wait
DBE3          finit
6A 04         push 4
DB0424        fild dword ptr [esp]
58            pop eax
DDD2          fst st(2)
DDC0          ffree st
D9F7          fincstp
6A 05         push 5
DB0424        fild dword ptr [esp]
58            pop eax
D8C2          fadd st, st(2)
DDD2          fst st(2)
DDC0          ffree st
D9F7          fincstp
DB2D 5E204000 fld tbyte ptr [40205E]
DDD3          fst st(3)
DDC0          ffree st
D9F7          fincstp
DB6D CC       fld tbyte ptr [ebp-34]
DDD4          fst st(4)
DDC0          ffree st
D9F7          fincstp
6A 22         push 22
DB0424        fild dword ptr [esp]
58            pop eax
DDD5          fst st(5)
DDC0          ffree st
D9F7          fincstp
6A 03         push 3
DB0424        fild dword ptr [esp]
58            pop eax
D8FD          fdivr st, st(5)
DDD5          fst st(5)
DDC0          ffree st
D9F7          fincstp
D9C4          fld st(4)
D8C4          fadd st, st(4)
DDD4          fst st(4)
DDC0          ffree st
D9F7          fincstp
D9C3          fld st(3)
D8CB          fmul st, st(3)
DDD3          fst st(3)
DDC0          ffree st
D9F7          fincstp
D9C2          fld st(2)
D8EA          fsubr st, st(2)
DDD2          fst st(2)
DDC0          ffree st
D9F7          fincstp
DB2D 5E204000 fld tbyte ptr [40205E]
DDD3          fst st(3)
DDC0          ffree st
D9F7          fincstp
DB6D CC       fld tbyte ptr [ebp-34]
D8CB          fmul st, st(3)
DDD3          fst st(3)
DDC0          ffree st
D9F7          fincstp
D9C2          fld st(2)
D8C2          fadd st, st(2)
DDD2          fst st(2)
DDC0          ffree st
D9F7          fincstp
68 D0070000   push 7D0
DB0424        fild dword ptr [esp]
58            pop eax
D8C2          fadd st, st(2)
DDD2          fst st(2)
DDC0          ffree st
D9F7          fincstp

Trying to solve the mystery, I disassembled the BASIC binary and found code like this one:

Code Select

DD06B40D          fld qword [0xdb4]
DC064000          fadd qword [0x40]
DC0EAC0D          fmul qword [0xdac]
DC2E3000          fsubr qword [0x30]
DD06AC0D          fld qword [0xdac]
DC0EB40D          fmul qword [0xdb4]
DEC1              faddp st1
DC063800          fadd qword [0x38]
83EC08            sub sp,byte +0x8
DD5EEC            fstp qword [bp-0x14]

Now that doesn't look dramatically different from my own snippets, so I am still chasing the factor 10.
I am a bloody beginner in this, so I hope somebody with more experience can give me a hint how to solve the mystery...

Title: Re: FPU speed mysteries
Post by: MichaelW on May 23, 2008, 03:51:35 PM

If I take the snippet you posted and substitute a few variables to make it assemble, I get a cycle count of 130 (or 105 if I remove the finit, or 103 if I also remove the leading wait). Hopefully someone will correct me if I'm wrong here, but I doubt that this is due to overlap, where the CPU is effectively reading the cycle count before the FPU finishes, because placing a wait at the end of the FPU code does not change the count. I have not timed much FPU code, but for 72 FPU instructions 130 cycles seems at least reasonable to me. And assuming that the code is most or all of your loop code, on my relatively slow processor 1000000 loops would account for only about 260ms. Are you sure that your timings are accurate and that this code is the bottleneck?

Title: Re: FPU speed mysteries (2): Raymond's example
Post by: jj2007 on May 23, 2008, 03:54:47 PM

Sure there is room for some improvement, Michael; but here is another test:
I guess you all know the FpuLib and Raymond's example. I coded it in Masm (see Fpu help) and Basic:

Code Select

    L%=1000000
    P=92500
    R=6.75
    N=20
    FOR ebx=0 TO L%
      PMT=P*R/1200*(1+R/1200)^(N*12)/((1+R/1200)^(N*12)-1)
    NEXT ebx

That runs in slightly less than a second. Raymond's example in 32-bit assembler takes 6.3 seconds... a factor 7. Does that make sense?

Title: Re: FPU speed mysteries
Post by: jj2007 on May 23, 2008, 04:05:52 PM

Here is the complete Basic code - the relevant bits are delimited by GetTickCount, so it should be possible to find it. I attach the exe and the disassembling plus NDISASM in case you don't have it.

Code Select

P=92500
R=6.75
N=20
Ti%=GetTickCount()
FOR ebx=0 TO 1000000
  PMT=P*R/1200*(1+R/1200)^(N*12)/((1+R/1200)^(N*12)-1)
NEXT ebx
Ti2%=GetTickCount()
Ti%=Ti2%-Ti%
~MessageBox(0,"It took "+STR$(Ti%)+" ms for "+CHR$(13,10)+"1000000 loops to find"+CHR$(13,10)+"the number "+STR$(PMT), "Benchmark:", MB_OK)

[attachment deleted by admin]

Title: Re: FPU speed mysteries
Post by: MichaelW on May 23, 2008, 06:29:06 PM

QuoteThat runs in slightly less than a second. Raymond's example in 32-bit assembler takes 6.3 seconds... a factor 7. Does that make sense?

No it doesn't make sense. If I extract the essential parts of Raymond's example, and modify the code somewhat so it will produce the same result as your code and do the rate and month calculations that are in your code, on my 500 MHz P3 a million iterations takes ~630ms. The EXE from your attachment takes ~2200ms, and a FreeBASIC version of your code takes ~1900ms.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      PMT dq 0
      R   dq 6.75
      P   dd 92500
      N   dd 20
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke GetTickCount
    push eax
    xor ebx, ebx
    .WHILE ebx < 1000000

      fld   R         ; st(0)=yearly rate
      fld8  1200.0    ; st(1)=1200, st(0)=yearly rate
      fdiv            ; st(0)=monthly rate

      fild  N         ; st(0)=years, st(1)=monthly rate
      fld8  12.0      ; st(0)=12, st(1)=years, st(2)=monthly rate
      fmul            ; st(0)=months, st(1)=monthly rate

      fld   st(1)     ;-> st(0)=R, st(1)=months, st(2)=R
      fyl2xp1         ;-> st(0)=log2(1+R)*months, st(1)=R
      fld   st        ;-> st(0)=log, st(1)=log, st(2)=R
      frndint         ;-> st(0)=int(log), st(1)=log, st(2)=R
      fsub  st(1),st  ;-> st(0)=int(log), st(1)=log-int(log), st(2)=R
      fxch  st(1)     ;-> st(0)=log-int(log), st(1)=int(log), st(2)=R
      f2xm1           ;-> st(0)=2[log-int(log)]-1, st(1)=int(log), st(2)=R
      fld1
      fadd            ;-> st(0)=2[log-int(log)], st(1)=int(log), st(2)=R
      fscale          ;-> st(0)=(1+R)N, st(1)=int(log), st(2)=R
      fstp  st(1)     ;-> st(0)=(1+R)N, st(1)=R
      fld   st        ;-> st(0)=(1+R)N, st(1)=(1+R)N, st(2)=R
      fld1            ;-> st(0)=1, st(1)=(1+R)N, st(2)=(1+R)N, st(3)=R
      fsub            ;-> st(0)=(1+R)N-1, st(1)=(1+R)N, st(2)=R
      fdiv            ;-> st(0)=(1+R)N/[(1+R)N-1], st(1)=R
      fmul            ;-> st(0)=R*(1+R)N/[(1+R)N-1]
      fimul P         ;-> st(0)=P*R*(1+R)N/[(1+R)N-1]=Monthly payments

      fstp  PMT

      inc ebx
    .ENDW
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),"ms",13,10

    invoke crt_printf,chr$("%.15f%c%c"), PMT, 10, 10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


631ms
703.336709030826340

Title: Re: FPU speed mysteries
Post by: jj2007 on May 23, 2008, 07:37:38 PM

Thanks a lot, Michael. The FpuLib has considerable overheads, probably because it is supposed to be safe rather than fast. I will have to study carefully the timings of certain alternatives, especially what is the tradeoff between speed and accuracy.

Title: Re: FPU speed mysteries
Post by: MichaelW on May 23, 2008, 10:59:42 PM

For comparison, I translated your code to C and used the VC++2003 compiler to optimize it. I had to work to get the compiler to optimize the calculation without optimizing the redundant calculations away, and the changes I made probably increased the execution time by some small amount. Without optimization a million iterations took ~1560ms, and with /O2 /G6 optimizations ~730ms.

Title: Re: FPU speed mysteries
Post by: raymond on May 24, 2008, 04:11:54 AM

Quote from: jj2007 on May 23, 2008, 07:37:38 PM
The FpuLib has considerable overheads, probably because it is supposed to be safe rather than fast.

That is correct. The primary purpose of that library is to assist programmers with floating point operations until they become themselves sufficiently familiar with FPU instructions. The most significant overhead is the saving of the register contents (to avoid/minimize any trashing and provide free registers to perform the required operation) and the subsequent restoration of the register contents.

By providing the source code for all the functions, a secondary purpose of the library is to provide some concrete code examples to perform some of the more difficult computations (such as x^y).

Raymond

Title: Re: FPU speed mysteries
Post by: jj2007 on May 24, 2008, 07:59:03 AM

Quote from: raymond on May 24, 2008, 04:11:54 AM
The primary purpose of that library is to assist programmers with floating point operations until they become themselves sufficiently familiar with FPU instructions.

The FpuLib help file is indeed an excellent tutorial.

Quote
some concrete code examples to perform some of the more difficult computations (such as x^y).

If I ever come out with own routines, your contribution will be duely acknowledged, Raymond :thumbu

Title: Re: FPU speed mysteries
Post by: raymond on May 25, 2008, 02:34:01 AM

Quote
The FpuLib help file is indeed an excellent tutorial.

Just in case you are not already aware of it, there's also the FPU tutorial at:
http://www.ray.masmcode.com/fpu.html

Title: Re: FPU speed mysteries
Post by: jj2007 on May 29, 2008, 08:27:30 PM

Raymond, you may remember that I asked you if there is a way to do an approximate float comparison. What I got may be a bit slow, but it works - see attached exe. I cannot give the full source yet, as it is not in a presentable state, but I can show how it is being called, and the FPU bit that does the approximate comparison.

Usage

Code Select

rcPrec EQU 9	; constant with desired precision
.data
G1	dt 12.34
G2	dt 43.21

proc xxx
LOCAL f1:REAL10, f2:REAL10, FpuRes10:REAL10, tmp$ ; some bits missing here - just for illustration
  mvr f1, 11.11			; mvr = mov real
  mvr f2, 22.22
  mvr FpuRes10, 2256.8	; the expected value

  .if rc(4+5-G1*(f2-3.4)+G2*f1+2000, FpuRes10, rcPrec)==0
		MulCat Tmp$, Tmp$, "** Operands are equal or almost equal at ", <STR$(rcPrec)>, " digits precision", crlf$
  .elseif eax==-1
		MulCat Tmp$, Tmp$, "++ Second operand is bigger", crlf$
		inc ParseErrors
  .elseif eax==1
		MulCat Tmp$, Tmp$, "-- Second operand is smaller", crlf$
		inc ParseErrors
  .endif
  MsgBox tmp$, "Approximate comparisons:", MB_OK

The rc macro

Code Select

rc MACRO cmp1:REQ, cmp2:=<0>, prec:=<0>	; RealComp
LOCAL prec$, tmp
  % echo comparing <cmp1> with <cmp2>
  Let rcTmpVar1=<cmp1>
  Let rcTmpVar2=<cmp2>
  prec$ equ <0>
  tmp=prec
  % echo first prec$
  if tmp gt 10
	tmp=10
  endif
  if tmp gt 1
	prec$ SUBSTR <1000000000000000>, 1, tmp+1
  endif
  % echo secondPrecision prec$
  invoke RealComp, addr rcTmpVar1, addr rcTmpVar2, prec$
  % echo rcdonewith prec$
  EXITM <eax>
ENDM

The RealComp procedure

Code Select

RealComp proc p1stOp:DWORD, p2ndOp:DWORD, Precision:DWORD
	.if Precision
		ffree ST(7)		; push 1 for later subtraction
		fld1
	.endif
	ffree ST(7)		; push 1st op
	mov eax, p1stOp
	fld real10 ptr [eax]
	ffree ST(7)		; push 2nd op
	mov eax, p2ndOp
	fld real10 ptr [eax]
	.if Precision
		ftst				; second op=ST zero?
		fstsw ax		; move FPU flags C1 etc to ax
		bt eax, 14		; C3 is set if ST=0
		jnc rnc_DIVIDE	; go dividing them if ST is nonzero
		fincstp			; move ST1 to ST0
		ftst				; was ST1 also zero?
		fstsw ax		; move FPU flags C1 etc to ax
		bt eax, 14		; C3 is set if ST=0
		jnc @F
		xor eax, eax	; 0 signals ...
		ret				; ... both zero
rnc_DIVIDE:
		fdiv				; divide ST1 by ST0
		fsub ST, ST(1)	; substract 1, e.g. 1.01, 1.05 ->1.01/1.05-1=-0.00038
		ftst				; yeah, we must explicitly ask the FPU to set the flags
		fstsw ax		; move FPU flags C1 etc to ax
		bswap eax		; save sign
		ffree ST(7)		; push 1st op
		push Precision
		fild dword ptr [esp]		;; move into ST (0)
		add sp, 4		;; correct the stack
		fmul			; mul with ST and pop, e.g. -0.00038*10000
		frndint			; round to integer
		ftst				; result is zero?
		fstsw ax		; move FPU flags C1 etc to ax
		bt eax, 14		; C3 is set if ST=0
		jnc @F
		xor eax, eax	; operands are almost equal
		ret
@@:		bswap eax		; the sign is still interesting
		bt eax, 8		; C0
		jnc @F
		xor eax, eax	; positive?
		dec eax
		ret
@@:		xor eax, eax	; negative?
		inc eax
		ret
	.else
		fcompp		; pop twice
		; 14, 10, 9, 8
		fstsw ax	; move FPU flags C1 etc to ax
		bt eax, 14	; C3
		jnc @F
		xor eax, eax
		ret
@@:		bt eax, 8	; C0
		jnc @F
		xor eax, eax
		dec eax
		ret
@@:		xor eax, eax
		inc eax
		ret
	.endif
RealComp endp

So what I basically do is take two variables, divide them by each other, subtract 1, e.g.
2256.8/2256.81-1=-0.000004431
For 5 digits precision, I multiply with 100000, result -0.4431; frndint yields 0, so that is "approximately equal"
For 6 digits precision, I multiply with 1000000, result -4.4; frndint yields -4, so that is lower.
Simple and slow but it works.

[attachment deleted by admin]

Title: Re: FPU speed mysteries
Post by: raymond on May 30, 2008, 03:36:50 AM

Minor suggestion. Look up the fcomi instruction if you have at least a PentiumIV.
If you use that as your first test after loading your two variables,

Code Select


   fcomi st,st(1)
   jz    isequal
   pushfd   ;the Carry Flag would indicate whether st<st(1) or not
   pop   somevar   ;save that info for later

Then, after verifying they are not almost equal,

Code Select

   push somevar
   popf

Jump to the appropriate message according to the CF.

Title: Re: FPU speed mysteries
Post by: jj2007 on May 30, 2008, 08:28:20 AM

Quote from: raymond on May 30, 2008, 03:36:50 AM
Minor suggestion. Look up the fcomi instruction if you have at least a PentiumIV.
If you use that as your first test after loading your two variables,
Code Select Expand
fcomi st,st(1) jz isequal pushfd ;the Carry Flag would indicate whether st<st(1) or not pop somevar ;save that info for later
Then, after verifying they are not almost equal,
Code Select Expand
push somevar popf
Jump to the appropriate message according to the CF.

Thanks, Raymond. That could indeed save quite a number of cycles, but my impression is that here in the forum some guys still stick to their trusty P3s ;-)

I attach a new test, correcting the order (which was inversed) and my iteration count (counting to 7 was too difficult for me last night). The apparent inaccuracy of numbers ("Expected value is 410.23000xxx2") does not come from the FPU, which has exactly 410.23 in its registers, but rather from the invoke crt__gcvt, etc that I used to display the numbers as a string.

[attachment deleted by admin]

Title: Re: FPU speed mysteries
Post by: jj2007 on July 21, 2008, 10:52:18 PM

More mysteries: On my old notebook, boasting a P4 with 2.6 GHz, parser.exe (attached) runs in 530 milliseconds. Now I bought a new toy with a Celeron M at 1.6 GHz. Guess what? Parser runs in 125 ms, more than 4 times as fast. Is that a "known bug"? I could not find any reports on such dramatic FPU speed increases.

[attachment deleted by admin]

Title: Re: FPU speed mysteries
Post by: MichaelW on July 21, 2008, 11:15:28 PM

Perhaps during the test, or some portion of the test, the P4 is actually running at a lower clock speed than the Celeron. If you were timing the code in clock cycles, instead of seconds, the results would not vary significantly with the clock speed.

Title: Re: FPU speed mysteries
Post by: jj2007 on July 21, 2008, 11:23:17 PM

Quote from: MichaelW on July 21, 2008, 11:15:28 PM
Perhaps during the test, or some portion of the test, the P4 is actually running at a lower clock speed than the Celeron. If you were timing the code in clock cycles, instead of seconds, the results would not vary significantly with the clock speed.

Will test that asap, thanks. One strange behaviour of the old notebook (P4) is that the fan starts making noise when I run timed code for more than, say, three seconds; and then the speed actually decreases drastically. Apart from that, the timings are very stable.

Title: Re: FPU speed mysteries
Post by: raymond on July 22, 2008, 12:42:28 AM

I use an older P4-1.5GHz during part of the year. I also have a newer CoreDuo rated at 1.89GHz which I use during another part of the year. That newer one runs most of my apps between 4 and 7 times faster!!! It must have something to do with the internal architecture of the processor.

Title: Re: FPU speed mysteries
Post by: jj2007 on July 22, 2008, 07:33:05 AM

Quote from: MichaelW on July 21, 2008, 11:15:28 PM
If you were timing the code in clock cycles, instead of seconds, the results would not vary significantly with the clock speed.

Do the Timer macros measure FPU cycles the same way as CPU cycles?

Just timed it on my office desktop, a P4 at 3.4 GHz: 400-420 ms. As Raymond also noticed, the Core Duo CPU's seem to be 4x faster. I wonder how that would affect the FPU vs SSE2 debate...?

Title: Re: FPU speed mysteries
Post by: MichaelW on July 22, 2008, 03:14:32 PM

As far as I know, for recent processors the CPU and FPU share the same clock. Now that I think about it I recall reading, during the time that Intel and AMD were actively competing for the fastest processor, that Intel had compromised the P4 FPU to the point that even with the higher clock speeds of the P4 the Athlon FPU was much faster.

The MASM Forum Archive 2004 to 2012

Specialised Projects => Assembler/Compiler Technology => Topic started by: jj2007 on May 23, 2008, 01:27:25 PM