I am trying to develop some speedy FPU macros. Out of curiosity, I benchmarked them against a trusty old 16-bit BASIC; here is my test suite:
CrLf$=CHR$(13, 10), gRes=0
G1=12.34, G2=543.21, f1=11.11
Ti%=GetTickCount(), L%=1000000
FOR ebx=0 TO L%
gRes=gRes+5.5-G2*(f1+380.78*G1)+G2*f1+ebx*1.44444
gRes=gRes+5.5-G2*(f1+(34.5+G1))+G2*f1+ebx*1.503
gRes=gRes+5.5-G2*(f1+(3456.543/G1))+G2*f1+ebx*1.5
gRes=gRes+5.5-G2*(f1+34.5*G1)-G2*f1+ebx*1.5
NEXT ebx
Ti%=GetTickCount()-Ti%
~MessageBox(0,"It took me a while for"+CrLf$+STR$(L%)+" loops: "+STR$(Ti%)+" ms", "Benchmark:", MB_OK)
Now these 1 Mio loops take slightly less than 2 seconds. The same in assembler: slightly less than 2 seconds. So far, so good - it seems that 16-bit code knew already how to access the FPU.
What I forgot to mention, though, is that these 2 seconds refer to interpreted code; the same code compiled runs in about 200 ms... ten times faster than my pretty slim assembler routines. Here is a snippet - it could be optimised here and there, but a factor of 10???
9B wait
DBE3 finit
6A 04 push 4
DB0424 fild dword ptr [esp]
58 pop eax
DDD2 fst st(2)
DDC0 ffree st
D9F7 fincstp
6A 05 push 5
DB0424 fild dword ptr [esp]
58 pop eax
D8C2 fadd st, st(2)
DDD2 fst st(2)
DDC0 ffree st
D9F7 fincstp
DB2D 5E204000 fld tbyte ptr [40205E]
DDD3 fst st(3)
DDC0 ffree st
D9F7 fincstp
DB6D CC fld tbyte ptr [ebp-34]
DDD4 fst st(4)
DDC0 ffree st
D9F7 fincstp
6A 22 push 22
DB0424 fild dword ptr [esp]
58 pop eax
DDD5 fst st(5)
DDC0 ffree st
D9F7 fincstp
6A 03 push 3
DB0424 fild dword ptr [esp]
58 pop eax
D8FD fdivr st, st(5)
DDD5 fst st(5)
DDC0 ffree st
D9F7 fincstp
D9C4 fld st(4)
D8C4 fadd st, st(4)
DDD4 fst st(4)
DDC0 ffree st
D9F7 fincstp
D9C3 fld st(3)
D8CB fmul st, st(3)
DDD3 fst st(3)
DDC0 ffree st
D9F7 fincstp
D9C2 fld st(2)
D8EA fsubr st, st(2)
DDD2 fst st(2)
DDC0 ffree st
D9F7 fincstp
DB2D 5E204000 fld tbyte ptr [40205E]
DDD3 fst st(3)
DDC0 ffree st
D9F7 fincstp
DB6D CC fld tbyte ptr [ebp-34]
D8CB fmul st, st(3)
DDD3 fst st(3)
DDC0 ffree st
D9F7 fincstp
D9C2 fld st(2)
D8C2 fadd st, st(2)
DDD2 fst st(2)
DDC0 ffree st
D9F7 fincstp
68 D0070000 push 7D0
DB0424 fild dword ptr [esp]
58 pop eax
D8C2 fadd st, st(2)
DDD2 fst st(2)
DDC0 ffree st
D9F7 fincstp
Trying to solve the mystery, I disassembled the BASIC binary and found code like this one:
DD06B40D fld qword [0xdb4]
DC064000 fadd qword [0x40]
DC0EAC0D fmul qword [0xdac]
DC2E3000 fsubr qword [0x30]
DD06AC0D fld qword [0xdac]
DC0EB40D fmul qword [0xdb4]
DEC1 faddp st1
DC063800 fadd qword [0x38]
83EC08 sub sp,byte +0x8
DD5EEC fstp qword [bp-0x14]
Now that doesn't look dramatically different from my own snippets, so I am still chasing the factor 10.
I am a bloody beginner in this, so I hope somebody with more experience can give me a hint how to solve the mystery...
If I take the snippet you posted and substitute a few variables to make it assemble, I get a cycle count of 130 (or 105 if I remove the finit, or 103 if I also remove the leading wait). Hopefully someone will correct me if I'm wrong here, but I doubt that this is due to overlap, where the CPU is effectively reading the cycle count before the FPU finishes, because placing a wait at the end of the FPU code does not change the count. I have not timed much FPU code, but for 72 FPU instructions 130 cycles seems at least reasonable to me. And assuming that the code is most or all of your loop code, on my relatively slow processor 1000000 loops would account for only about 260ms. Are you sure that your timings are accurate and that this code is the bottleneck?
Sure there is room for some improvement, Michael; but here is another test:
I guess you all know the FpuLib and Raymond's example. I coded it in Masm (see Fpu help) and Basic:
L%=1000000
P=92500
R=6.75
N=20
FOR ebx=0 TO L%
PMT=P*R/1200*(1+R/1200)^(N*12)/((1+R/1200)^(N*12)-1)
NEXT ebx
That runs in slightly less than a second. Raymond's example in 32-bit assembler takes 6.3 seconds... a factor 7. Does that make sense?
Here is the complete Basic code - the relevant bits are delimited by GetTickCount, so it should be possible to find it. I attach the exe and the disassembling plus NDISASM in case you don't have it.
P=92500
R=6.75
N=20
Ti%=GetTickCount()
FOR ebx=0 TO 1000000
PMT=P*R/1200*(1+R/1200)^(N*12)/((1+R/1200)^(N*12)-1)
NEXT ebx
Ti2%=GetTickCount()
Ti%=Ti2%-Ti%
~MessageBox(0,"It took "+STR$(Ti%)+" ms for "+CHR$(13,10)+"1000000 loops to find"+CHR$(13,10)+"the number "+STR$(PMT), "Benchmark:", MB_OK)
[attachment deleted by admin]
QuoteThat runs in slightly less than a second. Raymond's example in 32-bit assembler takes 6.3 seconds... a factor 7. Does that make sense?
No it doesn't make sense. If I extract the essential parts of Raymond's example, and modify the code somewhat so it will produce the same result as your code and do the rate and month calculations that are in your code, on my 500 MHz P3 a million iterations takes ~630ms. The EXE from your attachment takes ~2200ms, and a FreeBASIC version of your code takes ~1900ms.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
PMT dq 0
R dq 6.75
P dd 92500
N dd 20
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke GetTickCount
push eax
xor ebx, ebx
.WHILE ebx < 1000000
fld R ; st(0)=yearly rate
fld8 1200.0 ; st(1)=1200, st(0)=yearly rate
fdiv ; st(0)=monthly rate
fild N ; st(0)=years, st(1)=monthly rate
fld8 12.0 ; st(0)=12, st(1)=years, st(2)=monthly rate
fmul ; st(0)=months, st(1)=monthly rate
fld st(1) ;-> st(0)=R, st(1)=months, st(2)=R
fyl2xp1 ;-> st(0)=log2(1+R)*months, st(1)=R
fld st ;-> st(0)=log, st(1)=log, st(2)=R
frndint ;-> st(0)=int(log), st(1)=log, st(2)=R
fsub st(1),st ;-> st(0)=int(log), st(1)=log-int(log), st(2)=R
fxch st(1) ;-> st(0)=log-int(log), st(1)=int(log), st(2)=R
f2xm1 ;-> st(0)=2[log-int(log)]-1, st(1)=int(log), st(2)=R
fld1
fadd ;-> st(0)=2[log-int(log)], st(1)=int(log), st(2)=R
fscale ;-> st(0)=(1+R)N, st(1)=int(log), st(2)=R
fstp st(1) ;-> st(0)=(1+R)N, st(1)=R
fld st ;-> st(0)=(1+R)N, st(1)=(1+R)N, st(2)=R
fld1 ;-> st(0)=1, st(1)=(1+R)N, st(2)=(1+R)N, st(3)=R
fsub ;-> st(0)=(1+R)N-1, st(1)=(1+R)N, st(2)=R
fdiv ;-> st(0)=(1+R)N/[(1+R)N-1], st(1)=R
fmul ;-> st(0)=R*(1+R)N/[(1+R)N-1]
fimul P ;-> st(0)=P*R*(1+R)N/[(1+R)N-1]=Monthly payments
fstp PMT
inc ebx
.ENDW
invoke GetTickCount
pop edx
sub eax, edx
print ustr$(eax),"ms",13,10
invoke crt_printf,chr$("%.15f%c%c"), PMT, 10, 10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
631ms
703.336709030826340
Thanks a lot, Michael. The FpuLib has considerable overheads, probably because it is supposed to be safe rather than fast. I will have to study carefully the timings of certain alternatives, especially what is the tradeoff between speed and accuracy.
For comparison, I translated your code to C and used the VC++2003 compiler to optimize it. I had to work to get the compiler to optimize the calculation without optimizing the redundant calculations away, and the changes I made probably increased the execution time by some small amount. Without optimization a million iterations took ~1560ms, and with /O2 /G6 optimizations ~730ms.
Quote from: jj2007 on May 23, 2008, 07:37:38 PM
The FpuLib has considerable overheads, probably because it is supposed to be safe rather than fast.
That is correct. The primary purpose of that library is to assist programmers with floating point operations until they become themselves sufficiently familiar with FPU instructions. The most significant overhead is the saving of the register contents (to avoid/minimize any trashing and provide free registers to perform the required operation) and the subsequent restoration of the register contents.
By providing the source code for all the functions, a secondary purpose of the library is to provide some concrete code examples to perform some of the more difficult computations (such as x^y).
Raymond
Quote from: raymond on May 24, 2008, 04:11:54 AM
The primary purpose of that library is to assist programmers with floating point operations until they become themselves sufficiently familiar with FPU instructions.
The FpuLib help file is indeed an excellent tutorial.
Quote
some concrete code examples to perform some of the more difficult computations (such as x^y).
If I ever come out with own routines, your contribution will be duely acknowledged, Raymond :thumbu
Quote
The FpuLib help file is indeed an excellent tutorial.
Just in case you are not already aware of it, there's also the FPU tutorial at:
http://www.ray.masmcode.com/fpu.html
Raymond, you may remember that I asked you if there is a way to do an approximate float comparison. What I got may be a bit slow, but it works - see attached exe. I cannot give the full source yet, as it is not in a presentable state, but I can show how it is being called, and the FPU bit that does the approximate comparison.
Usage
rcPrec EQU 9 ; constant with desired precision
.data
G1 dt 12.34
G2 dt 43.21
proc xxx
LOCAL f1:REAL10, f2:REAL10, FpuRes10:REAL10, tmp$ ; some bits missing here - just for illustration
mvr f1, 11.11 ; mvr = mov real
mvr f2, 22.22
mvr FpuRes10, 2256.8 ; the expected value
.if rc(4+5-G1*(f2-3.4)+G2*f1+2000, FpuRes10, rcPrec)==0
MulCat Tmp$, Tmp$, "** Operands are equal or almost equal at ", <STR$(rcPrec)>, " digits precision", crlf$
.elseif eax==-1
MulCat Tmp$, Tmp$, "++ Second operand is bigger", crlf$
inc ParseErrors
.elseif eax==1
MulCat Tmp$, Tmp$, "-- Second operand is smaller", crlf$
inc ParseErrors
.endif
MsgBox tmp$, "Approximate comparisons:", MB_OK
The rc macro
rc MACRO cmp1:REQ, cmp2:=<0>, prec:=<0> ; RealComp
LOCAL prec$, tmp
% echo comparing <cmp1> with <cmp2>
Let rcTmpVar1=<cmp1>
Let rcTmpVar2=<cmp2>
prec$ equ <0>
tmp=prec
% echo first prec$
if tmp gt 10
tmp=10
endif
if tmp gt 1
prec$ SUBSTR <1000000000000000>, 1, tmp+1
endif
% echo secondPrecision prec$
invoke RealComp, addr rcTmpVar1, addr rcTmpVar2, prec$
% echo rcdonewith prec$
EXITM <eax>
ENDM
The RealComp procedure
RealComp proc p1stOp:DWORD, p2ndOp:DWORD, Precision:DWORD
.if Precision
ffree ST(7) ; push 1 for later subtraction
fld1
.endif
ffree ST(7) ; push 1st op
mov eax, p1stOp
fld real10 ptr [eax]
ffree ST(7) ; push 2nd op
mov eax, p2ndOp
fld real10 ptr [eax]
.if Precision
ftst ; second op=ST zero?
fstsw ax ; move FPU flags C1 etc to ax
bt eax, 14 ; C3 is set if ST=0
jnc rnc_DIVIDE ; go dividing them if ST is nonzero
fincstp ; move ST1 to ST0
ftst ; was ST1 also zero?
fstsw ax ; move FPU flags C1 etc to ax
bt eax, 14 ; C3 is set if ST=0
jnc @F
xor eax, eax ; 0 signals ...
ret ; ... both zero
rnc_DIVIDE:
fdiv ; divide ST1 by ST0
fsub ST, ST(1) ; substract 1, e.g. 1.01, 1.05 ->1.01/1.05-1=-0.00038
ftst ; yeah, we must explicitly ask the FPU to set the flags
fstsw ax ; move FPU flags C1 etc to ax
bswap eax ; save sign
ffree ST(7) ; push 1st op
push Precision
fild dword ptr [esp] ;; move into ST (0)
add sp, 4 ;; correct the stack
fmul ; mul with ST and pop, e.g. -0.00038*10000
frndint ; round to integer
ftst ; result is zero?
fstsw ax ; move FPU flags C1 etc to ax
bt eax, 14 ; C3 is set if ST=0
jnc @F
xor eax, eax ; operands are almost equal
ret
@@: bswap eax ; the sign is still interesting
bt eax, 8 ; C0
jnc @F
xor eax, eax ; positive?
dec eax
ret
@@: xor eax, eax ; negative?
inc eax
ret
.else
fcompp ; pop twice
; 14, 10, 9, 8
fstsw ax ; move FPU flags C1 etc to ax
bt eax, 14 ; C3
jnc @F
xor eax, eax
ret
@@: bt eax, 8 ; C0
jnc @F
xor eax, eax
dec eax
ret
@@: xor eax, eax
inc eax
ret
.endif
RealComp endp
So what I basically do is take two variables, divide them by each other, subtract 1, e.g.
2256.8/2256.81-1=-0.000004431
For 5 digits precision, I multiply with 100000, result -0.4431; frndint yields 0, so that is "approximately equal"
For 6 digits precision, I multiply with 1000000, result -4.4; frndint yields -4, so that is lower.
Simple and slow but it works.
[attachment deleted by admin]
Minor suggestion. Look up the fcomi instruction if you have at least a PentiumIV.
If you use that as your first test after loading your two variables,
fcomi st,st(1)
jz isequal
pushfd ;the Carry Flag would indicate whether st<st(1) or not
pop somevar ;save that info for later
Then, after verifying they are not almost equal,
push somevar
popf
Jump to the appropriate message according to the CF.
Quote from: raymond on May 30, 2008, 03:36:50 AM
Minor suggestion. Look up the fcomi instruction if you have at least a PentiumIV.
If you use that as your first test after loading your two variables,
fcomi st,st(1)
jz isequal
pushfd ;the Carry Flag would indicate whether st<st(1) or not
pop somevar ;save that info for later
Then, after verifying they are not almost equal,
push somevar
popf
Jump to the appropriate message according to the CF.
Thanks, Raymond. That could indeed save quite a number of cycles, but my impression is that here in the forum some guys still stick to their trusty P3s ;-)
I attach a new test, correcting the order (which was inversed) and my iteration count (counting to 7 was too difficult for me last night). The apparent inaccuracy of numbers ("Expected value is 410.23000xxx2") does not come from the FPU, which has exactly 410.23 in its registers, but rather from the invoke
crt__gcvt, etc that I used to display the numbers as a string.
[attachment deleted by admin]
More mysteries: On my old notebook, boasting a P4 with 2.6 GHz, parser.exe (attached) runs in 530 milliseconds. Now I bought a new toy with a Celeron M at 1.6 GHz. Guess what? Parser runs in 125 ms, more than 4 times as fast. Is that a "known bug"? I could not find any reports on such dramatic FPU speed increases.
[attachment deleted by admin]
Perhaps during the test, or some portion of the test, the P4 is actually running at a lower clock speed than the Celeron. If you were timing the code in clock cycles, instead of seconds, the results would not vary significantly with the clock speed.
Quote from: MichaelW on July 21, 2008, 11:15:28 PM
Perhaps during the test, or some portion of the test, the P4 is actually running at a lower clock speed than the Celeron. If you were timing the code in clock cycles, instead of seconds, the results would not vary significantly with the clock speed.
Will test that asap, thanks. One strange behaviour of the old notebook (P4) is that the fan starts making noise when I run timed code for more than, say, three seconds; and then the speed actually
decreases drastically. Apart from that, the timings are very stable.
I use an older P4-1.5GHz during part of the year. I also have a newer CoreDuo rated at 1.89GHz which I use during another part of the year. That newer one runs most of my apps between 4 and 7 times faster!!! It must have something to do with the internal architecture of the processor.
Quote from: MichaelW on July 21, 2008, 11:15:28 PM
If you were timing the code in clock cycles, instead of seconds, the results would not vary significantly with the clock speed.
Do the Timer macros measure FPU cycles the same way as CPU cycles?
Just timed it on my office desktop, a P4 at 3.4 GHz: 400-420 ms. As Raymond also noticed, the Core Duo CPU's seem to be 4x faster. I wonder how that would affect the FPU vs SSE2 debate...?
As far as I know, for recent processors the CPU and FPU share the same clock. Now that I think about it I recall reading, during the time that Intel and AMD were actively competing for the fastest processor, that Intel had compromised the P4 FPU to the point that even with the higher clock speeds of the P4 the Athlon FPU was much faster.