Integer vs FPU square root
I have been working on some prime number generators recently and I needed to know if an integer square root routine could be significantly faster than the FPU square root. The attachment is a test app that compares the execution times for the FPU square root and a presumably fast integer square root routine that I found here:
http://www.azillionmonkeys.com/qed/sqroot.html#calcmeth
I made no attempt to (further) optimize the integer square root code and almost no attempt to optimize the FPU code. There are two versions of the integer square root code, one that is presumably optimized for the P4, and one for essentially everything else. There are two versions of the FPU square root code, one that saves the RC mode, sets it to truncate, and then restores the original mode before returning, and one that just accepts the original mode.
Running on a P3, with an arbitrary test value of 999, I get:
isqrt : 105 cycles
isqrt_p4 : 138 cycles
fisqrt : 124 cycles
fisqrt_not: 62 cycles
So considering that I can use the default RC mode, the best choice for running on a P3 is a no-brainer. I would like to know how the code does on other processors.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include timers.asm
isqrt PROTO :DWORD
isqrt_p4 PROTO :DWORD
fisqrt PROTO :DWORD
fisqrt_not PROTO :DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
xor ebx, ebx
.WHILE ebx <= 100
print ustr$(ebx),9
invoke isqrt, ebx
print ustr$(eax),9
invoke fisqrt, ebx
print ustr$(eax),13,10
inc ebx
.ENDW
print chr$(13,10)
inkey "Press any key to continue...",13,10
xor ebx, ebx
.WHILE ebx <= 100
print ustr$(ebx),9
invoke isqrt_p4, ebx
print ustr$(eax),9
invoke fisqrt_not, ebx
print ustr$(eax),13,10
inc ebx
.ENDW
print chr$(13,10)
inkey "Press any key to continue...",13,10
LOOP_COUNT EQU 1000000
print "isqrt : "
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke isqrt, 999
counter_end
print ustr$(eax)," cycles",13,10
print "isqrt_p4 : "
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke isqrt_p4, 999
counter_end
print ustr$(eax)," cycles",13,10
print "fisqrt : "
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke fisqrt, 999
counter_end
print ustr$(eax)," cycles",13,10
print "fisqrt_not: "
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke fisqrt_not, 999
counter_end
print ustr$(eax)," cycles",13,10,13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; This is based on MSVC code by Norbert Juffa, from Paul Hsieh's
; Square Roots page:
;
; http://www.azillionmonkeys.com/qed/sqroot.html#calcmeth
;
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
iterasm MACRO n
lea edx, [eax+(1 SHL n)]
lea ebx, [eax+(2 SHL n)]
shl edx, n
mov esi, ecx
sub ecx, edx
cmovnc eax, ebx
cmovc ecx, esi
ENDM
iterasm_p4 MACRO n
mov edx, 1 SHL n
mov ebx, 2 SHL n
or edx, eax
or ebx, eax
shl edx, n
sub ecx, edx
sbb esi, esi
sub eax, ebx
and eax, esi
add eax, ebx
and edx, esi
add ecx, edx
ENDM
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
isqrt proc arg:DWORD
mov ecx, [esp+4] ; arg
push ebx ; save as per calling convention
push esi ; save as per calling convention
xor eax, eax ; 2*root
; iteration 15
mov ebx, 2 SHL 15
mov esi, ecx
sub ecx, 1 SHL 30
cmovnc eax, ebx
cmovc ecx, esi
iterasm 14
iterasm 13
iterasm 12
iterasm 11
iterasm 10
iterasm 9
iterasm 8
iterasm 7
iterasm 6
iterasm 5
iterasm 4
iterasm 3
iterasm 2
iterasm 1
; iteration 0
mov edx, 1
mov ebx, 2
add edx, eax
add ebx, eax
sub ecx, edx
cmovnc eax, ebx
shr eax, 1
mov esi, [esp] ; restore as per calling convention
mov ebx, [esp+4] ; restore as per calling convention
add esp, 8 ; remove temp variables
ret 4 ; pop one DWORD arg and return
isqrt endp
align 4
isqrt_p4 proc arg:DWORD
mov ecx, [esp+4] ; arg
push ebx ; save as per calling convention
push esi ; save as per calling convention
xor eax, eax ; 2*root
; iteration 15
mov ebx, 2 SHL 15
mov esi, ecx
sub ecx, 1 SHL 30
cmovnc eax, ebx
cmovc ecx, esi
iterasm_p4 14
iterasm_p4 13
iterasm_p4 12
iterasm_p4 11
iterasm_p4 10
iterasm_p4 9
iterasm_p4 8
iterasm_p4 7
iterasm_p4 6
iterasm_p4 5
iterasm_p4 4
iterasm_p4 3
iterasm_p4 2
iterasm_p4 1
; iteration 0
mov edx, 1
mov ebx, 2
add edx, eax
add ebx, eax
sub ecx, edx
cmovnc eax, ebx
shr eax, 1
mov esi, [esp] ; restore as per calling convention
mov ebx, [esp+4] ; restore as per calling convention
add esp, 8 ; remove temp variables
ret 4 ; pop one DWORD arg and return
isqrt_p4 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
GLOBAL oldcw dw 0
GLOBAL newcw dw 0
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
fisqrt proc arg:DWORD
fstcw oldcw
fwait
mov ax, oldcw
or ax, 0C00h ; Set RC field for truncating mode
mov newcw, ax
fldcw newcw
fild DWORD PTR[esp+4]
fsqrt
push eax
fistp DWORD PTR[esp]
pop eax
fldcw oldcw
ret 4
fisqrt endp
align 4
fisqrt_not proc arg:DWORD
fild DWORD PTR[esp+4]
fsqrt
push eax
fistp DWORD PTR[esp]
pop eax
ret 4
fisqrt_not endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
[attachment deleted by admin]
Results for AMD Athlon64 3000+ (WinXP Pro SP2):
isqrt : 59 cycles
isqrt_p4 : 87 cycles
fisqrt : 68 cycles
fisqrt_not: 25 cycles
AthlonXP (Barton) 2600+ WinXP home
isqrt : 75 cycles
isqrt_p4 : 91 cycles
fisqrt : 68 cycles
fisqrt_not: 27 cycles
From the AMD Athlon optimization guide:
Quote
3DNow! instructions can be used to compute a very fast, highly accurate square root and reciprocal square root.
Optimized 15-Bit Precision Square Root
This square root operation can be executed in only seven cycles, assuming a program hides the latency of the first MOVD instruction within previous code. The reciprocal square root operation requires four cycles less than the square root operation.
Example 1:
MOVD MM0, [MEM] ; 0 | a
PFRSQRT MM1, MM0 ; 1/sqrt(a) | 1/sqrt(a) (approximate)
PUNPCKLDQ MM0, MM0 ; a | a (MMX instr.)
PFMUL MM0, MM1 ; sqrt(a) | sqrt(a)
Optimized 24-Bit Precision Square Root
This square root operation can be executed in only 19 cycles, assuming a program hides the latency of the first MOVD instruction within previous code. The reciprocal square root operation requires four cycles less than the square root operation.
Example 2:
MOVD MM0, [MEM] ; 0 | a
PFRSQRT MM1, MM0 ; 1/sqrt(a) | 1/sqrt(a) (approx.)
MOVQ MM2, MM1 ; X_0 = 1/(sqrt a) (approx.)
PFMUL MM1, MM1 ; X_0 * X_0 | X_0 * X_0 (step 1)
PUNPCKLDQ MM0, MM0 ; a | a (MMX instr)
PFRSQIT1 MM1, MM0 ; (intermediate) (step 2)
PFRCPIT2 MM1, MM2 ; 1/sqrt(a) | 1/sqrt(a) (step 3)
PFMUL MM0, MM1 ; sqrt(a) | sqrt(a)
From the page I linked:
Quote
Of course, with the advent of SIMD instruction sets in modern x86 CPUs, some of these approximations have been incorporated into the instruction set itself. The following is an example that computes two parallel inverse square roots using AMD's 3DNow! instruction set which was introduced in 1997:
.586
.MMX
.K3D
MOVQ MM0, [mem] ; b | a
PFRSQRT MM1, MM0 ; 1/sqrt(a) | 1/sqrt(a) (12 bit approx)
MOVQ MM2, MM0 ;
PUNPCKHDQ MM0, MM0 ; b | b
PFRSQRT MM0, MM0 ; 1/sqrt(b) | 1/sqrt(b) (12 bit approx)
PUNPCKLDQ MM1, MM0 ; 1/sqrt(b) | 1/sqrt(a) (12 bit approx)
MOVQ MM0, MM1 ;
PFMUL MM1, MM1 ; ~1/b | ~1/a (12 bit approx)
PFRSQIT1 MM1, MM2 ; ??? | ???
PFRCPIT2 MM1, MM0 ; 1/sqrt(b) | 1/sqrt(a) (23 bit approx)
Quote
Which, of course, will give the highest throughput of any of the code shown here, but is tied to a particular processor architecture.
Does the Athlon64/Opteron support 3DNow! ?
All AMD processors after K6-2 suport 3DNow!, including the Athlon 64,
SSE have reciprocal square root too, and Intel processors have SSE.
About the integer square root, i guess the P4 version will run slower than the general version even in a P4...
Pentium 4 (2GHz) XP Home
isqrt : 236 cycles
isqrt_p4 : 281 cycles
fisqrt : 51 cycles
fisqrt_not: 38 cycles
Thanks for testing, guys. The linked page did not actually state that the version I labeled 'P4' would be faster on a P4. I was assuming that the author of the isqrt code had tested it on a P4 and included the 'generic 386+' version for the P4, but I see now that I was misinterpreting this:
Quote
(Note that although other x86 processors can execute cmovCC instructions, they do not yield improvements in performance over the code given except on Athlon based processors.) This solution significantly outperforms the first method. Norbert wrote in to inform me that cmovCC, sbb and setCC instructions are intolerably slow on the P4 platform. However:
... the fastest ISQRT() algorithm by far is to go through the FPU. Even with the change in FP control word it's quite fast. It looks like the P4 has an optimization for FLDCW that's better than what is in the Athlon. Since most applications use just two different CW values I speculate that they might be "caching" the last two CWs seen internally, i.e. this is like renaming limited to 2 rename registers.
So it looks like for my purposes an FPU version would be best, for any recent processor.
Setting/restoring the FPU state kills performance, yet we need the correct truncated results. This is done by simply subtracting 0.5 from ST(0) just before fistp. *
fk_half real4 0.5
usqrt proc arg
fild arg
fsqrt
fsub fk_half
fistp arg
mov eax,arg
ret
usqrt endp
AthlonXP 2000+ : 24 cycles
*provided that we don't tinker with the FPU control word
well to what we can read on tests between amd64 and intel64, intel only has faster fpu, amd64 has faster integer sse/sse2/sse3
and compare it to code simplicity.
fpu shitty stack setting, mmx+ directly to mmx/xmm regs
fpu 80bit operations, data given only 64bit
xmm 128bit float data given 128bit
so choose what you want, even today games when rotate objects use mmx sse due with that you can rotate in 1 instruction 1-8 points 1x128,2x64,4x32bit,8x16 bit precision.
sorry but fpu is dead, i dont see point in using it, well maybe when someone has 486dx and we want compatiblity. also try to search prime number with 6 mln numbers with 64bit and 128bit, 2 times or even 4 times less operations. why 4? because try to mul 32x32bit
yeah its 1 instruction. but try to mul 64x64 we need at least 4-5 instructions
FPU is definitely not dead. At least for me, on AthlonXP it matches or often outperforms SSE on complex DSP tasks. The last benchmark I've made to compare them is on 6-point spline interpolation of mono audio. FPU: 123 cycles/sample SSE:~180 cycles/sample. But I am not as good in SSE optimization as in FPU..
The only place I've seen SSE outperform the FPU is in simple tasks like Audio=AudioIn1+AudioIn2 . This is executed with stereo audio for 1 cycle/sample (with SSE) vs 2 cycles/sample (with FPU).
Remember the thing about Win64 not supporting FPU and MMX? It's not true. Agner Fog talks about it here:
Calling Conventions - Agner Fog (http://www.agner.org/assem/calling_conventions.pdf)
Quote6.1 Can floating point registers be used in 64-bit Windows?
There is widespread confusion about whether 64-bit Windows allows the
use of the floating point registers ST(0)-ST(7) and the MM0 - MM7
registers that are aliased upon these. One early technical document
found at Microsoft's website says "x87/MMX registers are unavailable to
Native Windows64 applications" (Rich Brunner: Technical Details Of
Microsoft® Windows® For The AMD64 Platform, Dec. 2003). An AMD
document says: "64-bit Microsoft Windows does not strongly support MMX
and 3Dnow! instruction sets in the 64-bit native mode" (Porting and
Optimizing Multimedia Codecs for AMD64 architecture on Microsoft®
Windows®, July 21, 2004). A document in Microsoft's MSDN says: "A
caller must also handle the following issues when calling a callee:
[...] Legacy Floating-Point Support: The MMX and floating-point stack
registers (MM0-MM7/ST0-ST7) are volatile. That is, these legacy
floating-point stack registers do not have their state preserved across
context switches" (MSDN: Kernel-Mode Driver Architecture: Windows DDK:
Other Calling Convention Process Issues. Preliminary, June 14, 2004;
February 18, 2005).
This description is nonsense because it confuses saving registers
across function calls and saving registers across context switches.
Some versions of the Microsoft assembler ml64 (e.g. v. 8.00.40310)
gives the following message when attempts are made to use floating
point registers in 64 bit mode: "error A2222: x87 and MMX instructions
disallowed; legacy FP state not saved in Win64".
However, a public discussion forum quotes the following answers from
Microsoft engineers regarding this issue: "From: Program Manager in
Visual C++ Group, Sent: Thursday, May 26, 2005 10:38 AM. It does
preserve the state. It's the DDK page that has stale information, which
I've requested it to be changed. Let them know that the OS does
preserve state of x87 and MMX registers on context switches." and
"From: Software Engineer in Windows Kernel Group, Sent: Thursday, May
26, 2005 11:06 AM. For user threads the state of legacy floating point
is preserved at context switch. But it is not true for kernel threads.
Kernel mode drivers can not use legacy floating point instructions."
(www.planetamd64.com/index.php?showtopic=3458&st=100 (http://www.planetamd64.com/index.php?showtopic=3458&st=100)).
The issue has finally been resolved with the long overdue publication
of a more detailed ABI for x64 Windows in the form of a document
entitled "x64 Software Conventions", well hidden in the bin directory
(not the help directory) of some compiler packages. This document says:
"The MMX and floating-point stack registers (MM0-MM7/ST0-ST7) are
preserved across context switches. There is no explicit calling
convention for these registers. The use of these registers is strictly
prohibited in kernel mode code." My tests indicate that these registers
are saved correctly during task switches and thread switches in 64-bit
mode, even in an early beta version of x64 Windows. Furthermore, I see
no reason to not save these registers. If the floating point registers
were not saved during a task switch then they would have to be cleared
for security reasons; and the time required for clearing these
registers would be no less than the time required for saving these
registers. The floating point registers must be supported when running
legacy 32-bit programs as well. These considerations do not apply to
kernel mode where information security is less of an issue. According
to the above information, the operating system may not save the
floating point registers across kernel mode context switches.
The Microsoft C++ compiler version 14.0 never uses these registers in
64-bit mode, and doesn't support long double precision. The Intel C++
compiler for x64 Windows supports long double precision and __m64 in
version 9.0 and later, while earlier versions do not.
The conclusion is that it is safe to use floating point registers and
MMX registers in 64-bit Windows, except in kernel mode drivers.
The FPU and MMX are definately not dead! :U
Quotefpu 80bit operations, data given only 64bit
xmm 128bit float data given 128bit
Ah yes, "precision". This is another major plus of the FPU :) . Remember, SSE was made with one thing in mind: 3D games. 32-bit floats are ideal for games, but many other areas of software need higher precision ^^ .
I'm not saying "ignore SSE" or "ignore FPU" - just that the coder should make the same FP-using proc in two versions, and compare their performance extensively. It's a relief that we can use the two sets at once :toothy
FPU still have its place in 3D.
Normalizing vectors in SIMD is faster, but sometimes you get a vector that is not of length 1.0. Wich can lead to accumulative errors.
Thus is a tradeoff between speed and accuracy.