News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Integer vs FPU square root

Started by MichaelW, March 19, 2006, 09:19:45 AM

Previous topic - Next topic

MichaelW

Integer vs FPU square root

I have been working on some prime number generators recently and I needed to know if an integer square root routine could be significantly faster than the FPU square root. The attachment is a test app that compares the execution times for the FPU square root and a presumably fast integer square root routine that I found here:

http://www.azillionmonkeys.com/qed/sqroot.html#calcmeth

I made no attempt to (further) optimize the integer square root code and almost no attempt to optimize the FPU code. There are two versions of the integer square root code, one that is presumably optimized for the P4, and one for essentially everything else. There are two versions of the FPU square root code, one that saves the RC mode, sets it to truncate, and then restores the original mode before returning, and one that just accepts the original mode.

Running on a P3, with an arbitrary test value of 999, I get:

isqrt     : 105 cycles
isqrt_p4  : 138 cycles
fisqrt    : 124 cycles
fisqrt_not: 62 cycles


So considering that I can use the default RC mode, the best choice for running on a P3 is a no-brainer. I would like to know how the code does on other processors.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include timers.asm

    isqrt       PROTO :DWORD
    isqrt_p4    PROTO :DWORD

    fisqrt      PROTO :DWORD
    fisqrt_not  PROTO :DWORD

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    xor   ebx, ebx
    .WHILE ebx <= 100
      print ustr$(ebx),9
      invoke isqrt, ebx
      print ustr$(eax),9
      invoke fisqrt, ebx
      print ustr$(eax),13,10
      inc   ebx
    .ENDW
    print chr$(13,10)

    inkey "Press any key to continue...",13,10

    xor   ebx, ebx
    .WHILE ebx <= 100
      print ustr$(ebx),9
      invoke isqrt_p4, ebx
      print ustr$(eax),9
      invoke fisqrt_not, ebx
      print ustr$(eax),13,10
      inc   ebx
    .ENDW
    print chr$(13,10)

    inkey "Press any key to continue...",13,10

    LOOP_COUNT EQU 1000000

    print "isqrt     : "
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke isqrt, 999
    counter_end
    print ustr$(eax)," cycles",13,10

    print "isqrt_p4  : "
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke isqrt_p4, 999
    counter_end
    print ustr$(eax)," cycles",13,10

    print "fisqrt    : "
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke fisqrt, 999
    counter_end
    print ustr$(eax)," cycles",13,10

    print "fisqrt_not: "
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke fisqrt_not, 999
    counter_end
    print ustr$(eax)," cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; This is based on MSVC code by Norbert Juffa, from Paul Hsieh's
; Square Roots page:
;
;   http://www.azillionmonkeys.com/qed/sqroot.html#calcmeth
;
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

iterasm MACRO n
    lea   edx, [eax+(1 SHL n)]
    lea   ebx, [eax+(2 SHL n)]
    shl   edx, n
    mov   esi, ecx
    sub   ecx, edx
    cmovnc eax, ebx
    cmovc ecx, esi
ENDM

iterasm_p4 MACRO n
    mov   edx, 1 SHL n
    mov   ebx, 2 SHL n
    or    edx, eax
    or    ebx, eax
    shl   edx, n
    sub   ecx, edx
    sbb   esi, esi
    sub   eax, ebx
    and   eax, esi
    add   eax, ebx
    and   edx, esi
    add   ecx, edx
ENDM

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

isqrt proc arg:DWORD
    mov   ecx, [esp+4]          ; arg
    push  ebx                   ; save as per calling convention
    push  esi                   ; save as per calling convention
    xor   eax, eax              ; 2*root
    ; iteration 15
    mov   ebx, 2 SHL 15
    mov   esi, ecx
    sub   ecx, 1 SHL 30
    cmovnc eax, ebx
    cmovc ecx, esi
    iterasm 14
    iterasm 13
    iterasm 12
    iterasm 11
    iterasm 10
    iterasm 9
    iterasm 8
    iterasm 7
    iterasm 6
    iterasm 5
    iterasm 4
    iterasm 3
    iterasm 2
    iterasm 1
    ; iteration 0
    mov   edx, 1
    mov   ebx, 2
    add   edx, eax
    add   ebx, eax
    sub   ecx, edx
    cmovnc eax, ebx
    shr   eax, 1
    mov   esi, [esp]            ; restore as per calling convention
    mov   ebx, [esp+4]          ; restore as per calling convention
    add   esp, 8                ; remove temp variables
    ret   4                     ; pop one DWORD arg and return
isqrt endp

align 4

isqrt_p4 proc arg:DWORD
    mov   ecx, [esp+4]          ; arg
    push  ebx                   ; save as per calling convention
    push  esi                   ; save as per calling convention
    xor   eax, eax              ; 2*root
    ; iteration 15
    mov   ebx, 2 SHL 15
    mov   esi, ecx
    sub   ecx, 1 SHL 30
    cmovnc eax, ebx
    cmovc ecx, esi
    iterasm_p4 14
    iterasm_p4 13
    iterasm_p4 12
    iterasm_p4 11
    iterasm_p4 10
    iterasm_p4 9
    iterasm_p4 8
    iterasm_p4 7
    iterasm_p4 6
    iterasm_p4 5
    iterasm_p4 4
    iterasm_p4 3
    iterasm_p4 2
    iterasm_p4 1
    ; iteration 0
    mov   edx, 1
    mov   ebx, 2
    add   edx, eax
    add   ebx, eax
    sub   ecx, edx
    cmovnc eax, ebx
    shr   eax, 1
    mov   esi, [esp]            ; restore as per calling convention
    mov   ebx, [esp+4]          ; restore as per calling convention
    add   esp, 8                ; remove temp variables
    ret   4                     ; pop one DWORD arg and return
isqrt_p4 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

GLOBAL oldcw dw 0
GLOBAL newcw dw 0

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

fisqrt proc arg:DWORD
    fstcw oldcw
    fwait
    mov   ax, oldcw
    or    ax, 0C00h             ; Set RC field for truncating mode
    mov   newcw, ax
    fldcw newcw

    fild  DWORD PTR[esp+4]
    fsqrt
    push  eax
    fistp DWORD PTR[esp]
    pop   eax 

    fldcw oldcw
    ret   4
fisqrt endp

align 4

fisqrt_not proc arg:DWORD
    fild  DWORD PTR[esp+4]
    fsqrt
    push  eax
    fistp DWORD PTR[esp]
    pop   eax
    ret   4
fisqrt_not endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start

[attachment deleted by admin]
eschew obfuscation

Phoenix

Results for AMD Athlon64 3000+ (WinXP Pro SP2):

isqrt     : 59 cycles
isqrt_p4  : 87 cycles
fisqrt    : 68 cycles
fisqrt_not: 25 cycles

dioxin

AthlonXP (Barton) 2600+ WinXP home
isqrt     : 75 cycles
isqrt_p4  : 91 cycles
fisqrt    : 68 cycles
fisqrt_not: 27 cycles

dioxin

From the AMD Athlon optimization guide:
Quote
3DNow! instructions can be used to compute a very fast, highly accurate square root and reciprocal square root.
Optimized 15-Bit Precision Square Root
This square root operation can be executed in only seven cycles, assuming a program hides the latency of the first MOVD instruction within previous code. The reciprocal square root operation requires four cycles less than the square root operation.
Example 1:
MOVD MM0, [MEM]     ; 0 | a
PFRSQRT MM1, MM0    ; 1/sqrt(a) | 1/sqrt(a) (approximate)
PUNPCKLDQ MM0, MM0  ; a | a (MMX instr.)
PFMUL MM0, MM1      ; sqrt(a) | sqrt(a)


Optimized 24-Bit Precision Square Root
This square root operation can be executed in only 19 cycles, assuming a program hides the latency of the first MOVD instruction within previous code. The reciprocal square root operation requires four cycles less than the square root operation.
Example 2:
MOVD MM0, [MEM]     ; 0 | a
PFRSQRT MM1, MM0    ; 1/sqrt(a) | 1/sqrt(a) (approx.)
MOVQ MM2, MM1       ; X_0 = 1/(sqrt a) (approx.)
PFMUL MM1, MM1      ; X_0 * X_0 | X_0 * X_0 (step 1)
PUNPCKLDQ MM0, MM0  ; a | a (MMX instr)
PFRSQIT1 MM1, MM0   ; (intermediate) (step 2)
PFRCPIT2 MM1, MM2   ; 1/sqrt(a) | 1/sqrt(a) (step 3)
PFMUL MM0, MM1      ; sqrt(a) | sqrt(a)



MichaelW

From the page I linked:
Quote
Of course, with the advent of SIMD instruction sets in modern x86 CPUs, some of these approximations have been incorporated into the instruction set itself. The following is an example that computes two parallel inverse square roots using AMD's 3DNow! instruction set which was introduced in 1997:

.586
.MMX
.K3D

MOVQ MM0, [mem] ;         b | a
PFRSQRT         MM1, MM0 ; 1/sqrt(a) | 1/sqrt(a) (12 bit approx)
MOVQ            MM2, MM0 ;
PUNPCKHDQ       MM0, MM0 ;         b | b
PFRSQRT         MM0, MM0        ; 1/sqrt(b) | 1/sqrt(b) (12 bit approx)
PUNPCKLDQ       MM1, MM0        ; 1/sqrt(b) | 1/sqrt(a) (12 bit approx)
MOVQ            MM0, MM1 ;
PFMUL           MM1, MM1 ;      ~1/b | ~1/a      (12 bit approx)
PFRSQIT1        MM1, MM2        ;       ??? | ???
PFRCPIT2        MM1, MM0 ; 1/sqrt(b) | 1/sqrt(a) (23 bit approx)

Quote
Which, of course, will give the highest throughput of any of the code shown here, but is tied to a particular processor architecture.

Does the Athlon64/Opteron support 3DNow! ?

eschew obfuscation

EduardoS

All AMD processors after K6-2 suport 3DNow!, including the Athlon 64,
SSE have reciprocal square root too, and Intel processors have SSE.

About the integer square root, i guess the P4 version will run slower than the general version even in a P4...

anon


Pentium 4 (2GHz) XP Home

isqrt     : 236 cycles
isqrt_p4  : 281 cycles
fisqrt    : 51 cycles
fisqrt_not: 38 cycles

MichaelW

Thanks for testing, guys. The linked page did not actually state that the version I labeled 'P4' would be faster on a P4. I was assuming that the author of the isqrt code had tested it on a P4 and included the 'generic 386+' version for the P4, but I see now that I was misinterpreting this:
Quote
(Note that although other x86 processors can execute cmovCC instructions, they do not yield improvements in performance over the code given except on Athlon based processors.) This solution significantly outperforms the first method. Norbert wrote in to inform me that cmovCC, sbb and setCC instructions are intolerably slow on the P4 platform. However:

... the fastest ISQRT() algorithm by far is to go through the FPU. Even with the change in FP control word it's quite fast. It looks like the P4 has an optimization for FLDCW that's better than what is in the Athlon. Since most applications use just two different CW values I speculate that they might be "caching" the last two CWs seen internally, i.e. this is like renaming limited to 2 rename registers. 

So it looks like for my purposes an FPU version would be best, for any recent processor.



eschew obfuscation

u

Setting/restoring the FPU state kills performance, yet we need the correct truncated results. This is done by simply subtracting 0.5 from ST(0) just before fistp. *

fk_half real4 0.5

usqrt proc arg
fild arg
fsqrt
fsub fk_half
fistp arg
mov eax,arg
ret
usqrt endp


AthlonXP 2000+ : 24 cycles


*provided that we don't tinker with the FPU control word
Please use a smaller graphic in your signature.

Human

well to what we can read on tests between amd64 and intel64, intel only has faster fpu, amd64 has faster integer sse/sse2/sse3
and compare it to code simplicity.
fpu shitty stack setting, mmx+ directly to mmx/xmm regs
fpu 80bit operations, data given only 64bit
xmm 128bit float data given 128bit
so choose what you want, even today games when rotate objects use mmx sse due with that you can rotate in 1 instruction 1-8 points 1x128,2x64,4x32bit,8x16 bit precision.
sorry but fpu is dead, i dont see point in using it, well maybe when someone has 486dx and we want compatiblity. also try to search prime number with 6 mln numbers with 64bit and 128bit, 2 times or even 4 times less operations. why 4? because try to mul 32x32bit
yeah its 1 instruction. but try to mul 64x64 we need at least 4-5 instructions

u

FPU is definitely not dead. At least for me, on AthlonXP it matches or often outperforms SSE on complex DSP tasks. The last benchmark I've made to compare them is on 6-point spline interpolation of mono audio. FPU: 123 cycles/sample SSE:~180 cycles/sample. But I am not as good in SSE optimization as in FPU..
The only place I've seen SSE outperform the FPU is in simple tasks like Audio=AudioIn1+AudioIn2 . This is executed with stereo audio for 1 cycle/sample (with SSE) vs 2 cycles/sample (with FPU).
Please use a smaller graphic in your signature.

GregL

Remember the thing about Win64 not supporting FPU and MMX? It's not true. Agner Fog talks about it here:

Calling Conventions - Agner Fog

Quote6.1 Can floating point registers be used in 64-bit Windows?

There is widespread confusion about whether 64-bit Windows allows the
use of the floating point registers ST(0)-ST(7) and the MM0 - MM7
registers that are aliased upon these. One early technical document
found at Microsoft's website says "x87/MMX registers are unavailable to
Native Windows64 applications" (Rich Brunner: Technical Details Of
Microsoft® Windows® For The AMD64 Platform, Dec. 2003). An AMD
document says: "64-bit Microsoft Windows does not strongly support MMX
and 3Dnow! instruction sets in the 64-bit native mode" (Porting and
Optimizing Multimedia Codecs for AMD64 architecture on Microsoft®
Windows®, July 21, 2004). A document in Microsoft's MSDN says: "A
caller must also handle the following issues when calling a callee:
[...] Legacy Floating-Point Support: The MMX and floating-point stack
registers (MM0-MM7/ST0-ST7) are volatile. That is, these legacy
floating-point stack registers do not have their state preserved across
context switches" (MSDN: Kernel-Mode Driver Architecture: Windows DDK:
Other Calling Convention Process Issues. Preliminary, June 14, 2004;
February 18, 2005).

This description is nonsense because it confuses saving registers
across function calls and saving registers across context switches.
Some versions of the Microsoft assembler ml64 (e.g. v. 8.00.40310)
gives the following message when attempts are made to use floating
point registers in 64 bit mode: "error A2222: x87 and MMX instructions
disallowed; legacy FP state not saved in Win64".

However, a public discussion forum quotes the following answers from
Microsoft engineers regarding this issue: "From: Program Manager in
Visual C++ Group, Sent: Thursday, May 26, 2005 10:38 AM. It does
preserve the state. It's the DDK page that has stale information, which
I've requested it to be changed. Let them know that the OS does
preserve state of x87 and MMX registers on context switches." and
"From: Software Engineer in Windows Kernel Group, Sent: Thursday, May
26, 2005 11:06 AM. For user threads the state of legacy floating point
is preserved at context switch. But it is not true for kernel threads.
Kernel mode drivers can not use legacy floating point instructions."
(www.planetamd64.com/index.php?showtopic=3458&st=100).

The issue has finally been resolved with the long overdue publication
of a more detailed ABI for x64 Windows in the form of a document
entitled "x64 Software Conventions", well hidden in the bin directory
(not the help directory) of some compiler packages. This document says:
"The MMX and floating-point stack registers (MM0-MM7/ST0-ST7) are
preserved across context switches. There is no explicit calling
convention for these registers. The use of these registers is strictly
prohibited in kernel mode code." My tests indicate that these registers
are saved correctly during task switches and thread switches in 64-bit
mode, even in an early beta version of x64 Windows. Furthermore, I see
no reason to not save these registers. If the floating point registers
were not saved during a task switch then they would have to be cleared
for security reasons; and the time required for clearing these
registers would be no less than the time required for saving these
registers. The floating point registers must be supported when running
legacy 32-bit programs as well. These considerations do not apply to
kernel mode where information security is less of an issue. According
to the above information, the operating system may not save the
floating point registers across kernel mode context switches.

The Microsoft C++ compiler version 14.0 never uses these registers in
64-bit mode, and doesn't support long double precision. The Intel C++
compiler for x64 Windows supports long double precision and __m64 in
version 9.0 and later, while earlier versions do not.

The conclusion is that it is safe to use floating point registers and
MMX registers in 64-bit Windows, except in kernel mode drivers.

The FPU and MMX are definately not dead!  :U




u

Quotefpu 80bit operations, data given only 64bit
xmm 128bit float data given 128bit
Ah yes, "precision". This is another major plus of the FPU :) . Remember, SSE was made with one thing in mind: 3D games. 32-bit floats are ideal for games, but many other areas of software need higher precision ^^ .
I'm not saying "ignore SSE" or "ignore FPU" - just that the coder should make the same FP-using proc in two versions, and compare their performance extensively. It's a relief that we can use the two sets at once  :toothy
Please use a smaller graphic in your signature.

xanatose

FPU still have its place in 3D.

Normalizing vectors in SIMD is faster, but sometimes you get a vector that is not of length 1.0.  Wich can lead to accumulative errors.

Thus is a tradeoff between speed and accuracy.