News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Performance / Timing Wierdness ASM vs C#

Started by johnsa, June 15, 2008, 10:48:31 PM

Previous topic - Next topic

jj2007

Not sure if this is helpful: I have tried to "synchronise" the two listings above in two text files. Open them in an editor, and Alt Tab task switch to spot the differences.
Apart from the fld1, fdivr sequence which aims at substituting the divs with muls, there is this oddity:

; r.z = i.z / rlen;
; *** fld   dword ptr [esi+0Ch]





[attachment deleted by admin]

johnsa

MichaelW, I've taken your test piece and re-inserted my updated FPU version. I've also added the MS timings for both at the bottom: You'll see now they're almost identical although I think using fmul st,st(0) to square provides a slight performance increase.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      v1        Vector3D  <1.0,2.0,3.0,1.0>
      vr        Vector3D  <>
      fpusw     dw      0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

mov esi,ptrV1
mov edi,ptrVR

fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(0)
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(0)
faddp st(1),st
fld dword ptr (Vector3D PTR [esi]).z
fmul st,st(0)
faddp st(1),st
fsqrt         
fld1
fdivr
fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).x
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).y
fmul dword ptr (Vector3D PTR [esi]).z
fstp dword ptr (Vector3D PTR [edi]).z
 
ret

Vector3D_Normalize_FPU ENDP

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

normalize proc pv1:DWORD, pvr:DWORD

  mov ecx, pv1
    mov edx, pvr
    fld [ecx].Vector3D.z
    fmul [ecx].Vector3D.z
    fld [ecx].Vector3D.y
    fmul [ecx].Vector3D.y
    faddp st(1), st
    fld [ecx].Vector3D.x
    fmul [ecx].Vector3D.x
    faddp st(1), st
    fsqrt
    fld4 1.0
    fdivr
    fld [ecx].Vector3D.x
    fmul st, st(1)
    fstp [edx].Vector3D.x
    fld [ecx].Vector3D.y
    fmul st, st(1)
    fstp [edx].Vector3D.y
    fmul [ecx].Vector3D.z
    fstp [edx].Vector3D.z

    ret

normalize endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10

    timer_begin 10000000, HIGH_PRIORITY_CLASS
invoke Vector3D_Normalize_FPU, ADDR v1, ADDR vr
    timer_end
    print ustr$(eax)
    print chr$(" Vector3D Normalize ms",13,10)

    timer_begin 10000000, HIGH_PRIORITY_CLASS
invoke normalize, ADDR v1, ADDR vr
    timer_end
    print ustr$(eax)
    print chr$("Normalize ms",13,10)

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


So.. on my machine both of these come in at around 530ms... while the C#.Net version still manages 150ms for the same number of iterations... so we're still about 4 times slower... and they're definately no stack faults now.
Any more thoughts? Perhaps re-compare this to the C++ testpiece?

johnsa

MichaelW, I wrote the same test piece in C++, using GetTickCount (same as yours basically)
and the result from that are:

94ms (release mode with the pragmas)
78ms (release mode - removed the pragma)
155ms (debug mode with pragmas)

So the C# in release mode seems to be equivalent to C++ in debug mode. C++ version is now about 6 times faster than the asm test piece i just posted.
I'm wondering if this is somehow specific to my machine?
Perhaps re-try the last test-piece I posted and the C++ one again.

jj2007

73 cycles, Vector3D_Normalize_FPU
77 cycles, normalize
345 Vector3D Normalize ms
320Normalize ms

P4 2.4 GHz

Numerical output is identical,  I suppose?

johnsa

It's odd that mine comes in a few cycles less and with fewer memory accesses yet MichaelW's is about 20ms faster on P4.. on my PM the fmul st,st(0) seems to be faster, but P4 it seems like the memory is.. odd..

jj2007

Quote from: johnsa on June 17, 2008, 05:47:46 PM
C++ version is now about 6 times faster than the asm test piece i just posted.
This just doesn't make sense: C++ uses assembler (and machine code, eventually), so it cannot be faster than the same code in asm. Can you isolate those bits that are just a little bit different? And eliminate the differences step by step? I am a newbie in this field, but things that come to my mind are:
- stack fault (see above)
- denormalised numbers (that's why I asked earlier if the results - not: the timings - are identical)

For example, between your two listings I see the compiler insert two fstp/fld sequences; what is their function? Delay FPU execution??

faddp   st(1),st
; c: fstp   qword ptr [ebp-58h]
; c: fld   qword ptr [ebp-58h]
fsqrt
; c: fstp   qword ptr [ebp-50h]
; c: fld   qword ptr [ebp-50h]

Why is there no
fld   dword ptr (Vector3D PTR [esi]).z
in the third last row of your asm listing?

johnsa

We've ruled out stack faults now after checking the fpu status word's stack bit after 1000 iterations of the function.

Results are correct and not denormal as the input vector isn't modified. It's repeatedly updated and stored into a result vector.

Those fld, fstp's from the C code I think are a product of it not being smart enough to optimize in the dependancy. It completes a result stores it, then the next stage of the calculation reloads that value.
Hence why it fstp to [ebp-58h] and then immediately loads the same value again.

I don't load the z in the 3rd last row because the calculation i want is (1/length vector) which is already in st0 to be multiplied with z. so just doing the mul with produce the result in st0 which can then be fstp immediately back to z.

More than that... I'm utterly confused :)

jj2007

I had run it through Olly and did not see anything suspicious - thanks for explaining in detail what you have done. Really odd. Any chance to isolate the slow instruction? Inserting QPC calls is probably not an option...

johnsa

Re ran everything via debugger and double checked fpu status after every instruction, no exceptions. The only thing that gets set every is the P(recision) bit in the status when the fsqrt happens which is unavoidable.

I've now come to the conclusion that C#/C++ compiler must be doing something else sneaky somewhere... like setting the FPU to lowest precision if your code never uses a double.. maybe they assume that if the whole code only contains floats, they can get away with setting the round mode to real4 and maybe trunc'ing instead of round.. this could speed up the fpu operations?  ::)

Neo

Quote from: johnsa on June 17, 2008, 10:54:50 PM
like setting the FPU to lowest precision if your code never uses a double.. maybe they assume that if the whole code only contains floats, they can get away with setting the round mode to real4 and maybe trunc'ing instead of round.. this could speed up the fpu operations?  ::)
Yup, that could certainly do it, especially for fdiv and fsqrt.

MichaelW

Quote from: johnsa on June 17, 2008, 10:54:50 PM
I've now come to the conclusion that C#/C++ compiler must be doing something else sneaky somewhere... like setting the FPU to lowest precision if your code never uses a double.. maybe they assume that if the whole code only contains floats, they can get away with setting the round mode to real4 and maybe trunc'ing instead of round.. this could speed up the fpu operations?  ::)

Good idea. The clock cycle counts for fdiv in my previous test seemed to imply that the precision was set to something less than 64 bits. This code tests the effects on a version of your original code with an ffree added to eliminate the stack fault, and my version.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    vector struct
      x REAL4 ?
      y REAL4 ?
      z REAL4 ?
    vector ends

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      v1        vector  <1.0,2.0,3.0>
      vr        vector  <>
      fpusw     dw      0
      fpucw     dw      0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

  mov esi,ptrV1
  mov edi,ptrVR

  fld dword ptr [esi]
  fmul st,st(0)
  fld dword ptr [esi+4]
  fmul st,st(0)
  faddp st(1),st
  fld dword ptr [esi+8]
  fmul st,st(0)
  faddp st(1),st
  fsqrt
  fld dword ptr [esi]
  fdiv st,st(1)
  fstp dword ptr [edi]
  fld dword ptr [esi+4]
  fdiv st,st(1)
  fstp dword ptr [edi+4]
  fld dword ptr [esi+8]
  fdiv st,st(1)
  ffree st(1)
  fstp dword ptr [edi+8]

comment |
  fstsw fpusw
  fwait
  test fpusw, 40h
  jz  @F
  print "SF",13,10
@@:
  print "OK",13,10
|

  ret

Vector3D_Normalize_FPU ENDP

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

normalize proc pv1:DWORD, pvr:DWORD

    mov ecx, pv1
    mov edx, pvr
    fld [ecx].vector.z
    fmul [ecx].vector.z
    fld [ecx].vector.y
    fmul [ecx].vector.y
    faddp st(1), st
    fld [ecx].vector.x
    fmul [ecx].vector.x
    faddp st(1), st
    fsqrt
    fld4 1.0
    fdivr
    fld [ecx].vector.x
    fmul st, st(1)
    fstp [edx].vector.x
    fld [ecx].vector.y
    fmul st, st(1)
    fstp [edx].vector.y
    fmul [ecx].vector.z
    fstp [edx].vector.z
    ret

normalize endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    invoke Sleep, 3000

    ; ---------------------------------
    ; Display the current value of the
    ; control word PC field (bits 9-8).
    ; ---------------------------------

    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10,13,10

    ; ------------------------------------------------------
    ; Restore FPU to initialized state to set the PC field
    ; to 11b = 64 bits, then read the value and display it.
    ; ------------------------------------------------------

    finit
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10,13,10

    ; -------------------------------------------------------
    ; Set the PC field in the control word to 11b = 53 bits,
    ; then read the value back and display it.
    ; -------------------------------------------------------

    fstcw fpucw
    and fpucw, 1111111011111111b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10,13,10

    ; -------------------------------------------------------
    ; Set the PC field in the control word to 00b = 24 bits,
    ; then read the value back and display it.
    ; -------------------------------------------------------

    fstcw fpucw
    and fpucw, not 1100000000b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

Control word PC field = 00000002h

Control word PC field = 00000003h
188 cycles, Vector3D_Normalize_FPU
130 cycles, normalize

Control word PC field = 00000002h
159 cycles, Vector3D_Normalize_FPU
131 cycles, normalize

Control word PC field = 00000000h
88 cycles, Vector3D_Normalize_FPU
130 cycles, normalize


I didn't have time to determine why my code is not affected, or why the initial PC setting does not match the FPU initialized state, or to perform any function tests to see what effect the PC setting might have on the return values.
eschew obfuscation

jj2007

On precision bit (Randy Hyde vs Jentje Goslinga):
1. There is a lot of misinformation about the precision bit.
Unless I am terribly wrong the precision bit does not affect
Floating Point Multiplication, Addition or Subtraction, but
only Division and Square Root.
It does not even come into play when multiplying integers.
Neither does it affect any of the other (few) transcendentals.
[One might wonder why the precision bit does not affect the
other transcendentals: probably because they are not computed
using an iterative algorithm]

2. Having settled that issue, the control word in the FPU is
initialized on FPINIT to 037FH which masks all FP interrupts
and sets the precision to 64 bits, which is the maximum. You
are probably confusing the 64 bits mantissa which is Extended
Precision with a 64 bit double which is just a double.
Note that there are two bit since there are three settings.

Still, no chance to explain a factor 6 difference with the precision bit set or not set...

johnsa

MichaelW, you're on to something there.. one little thing though.. TIMERS.ASM calls finit in those end MACROS... so actually in each timing/cycle count loop the PC mode is back to 03h :)
If you print it out directly after the loop it's be re-finit'ed.

Latest Results:

C++ Test-App using straight 3 divs and fsqrt (no recip). 1,000,000 iterations.
156ms debug mode
94ms   release mode with pragma optimizations switched off
78ms   release mode all optimizations

ASM Test Piece (using reciprocal with fmuls) set to PC to REAL4 - 1,000,000 iterations.
MichaelW's Normalize 13ms
Vector3D Noramlize   13ms

So now.. the ASM version is 5 times faster than the c++ version (mainly due to REAL4 PC and reciprocal).

What is really strange now.. is that I have two different asm files, pretty much identical in the same folder using the same timers.asm running the same loop of the same function.
assemble/link both one runs in 13ms... the other 26ms exactly.. everytime... and for no reason I can see.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686p
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
  align 16
      v1        Vector3D  <1.0,2.0,3.0,1.0>
      vr        Vector3D  <>
      fpusw     dw      0
  fpucw     dw      0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

mov esi,ptrV1
mov edi,ptrVR

fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(0)
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(0)
faddp st(1),st
fld dword ptr (Vector3D PTR [esi]).z
fmul st,st(0)
faddp st(1),st
fsqrt         
fld1
fdivr
fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).x
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).y
fmul dword ptr (Vector3D PTR [esi]).z
fstp dword ptr (Vector3D PTR [edi]).z
 
ret

Vector3D_Normalize_FPU ENDP

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

normalize proc pv1:DWORD, pvr:DWORD

  mov ecx, pv1
    mov edx, pvr
    fld [ecx].Vector3D.z
    fmul [ecx].Vector3D.z
    fld [ecx].Vector3D.y
    fmul [ecx].Vector3D.y
    faddp st(1), st
    fld [ecx].Vector3D.x
    fmul [ecx].Vector3D.x
    faddp st(1), st
    fsqrt
    fld4 1.0
    fdivr
    fld [ecx].Vector3D.x
    fmul st, st(1)
    fstp [edx].Vector3D.x
    fld [ecx].Vector3D.y
    fmul st, st(1)
    fstp [edx].Vector3D.y
    fmul [ecx].Vector3D.z
    fstp [edx].Vector3D.z

    ret

normalize endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10

fstcw fpucw
    and fpucw,1111110011111111b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    timer_begin 1000000, HIGH_PRIORITY_CLASS
invoke Vector3D_Normalize_FPU, ADDR v1, ADDR vr
    timer_end
    print ustr$(eax)
    print chr$(" Vector3D Normalize ms",13,10)

fstcw fpucw
    and fpucw,1111110011111111b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    timer_begin 1000000, HIGH_PRIORITY_CLASS
invoke normalize, ADDR v1, ADDR vr
    timer_end
    print ustr$(eax)
    print chr$(" Normalize ms",13,10)

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Anyhow.. there is the update ASM with PC=REAL4 and the last revision of the actual FPU code.

MichaelW

QuoteTIMERS.ASM calls finit...

You would think I would be able to remember that  :red

It works for the first call because the finit comes after the test loop has ended.

This version corrects the problem and displays the return values for the 64 and 24-bit precisions to 8 digits:

EDIT: updated to your most recent procedure.

EDIT2: and now I realize that they assemble to the same instructions, so it stands to reason that the cycle counts would be the same.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      v1    Vector3D  <1.0,2.0,3.0,1.0>
      vr    Vector3D  <>
      dblx  REAL8     0.0
      dbly  REAL8     0.0
      dblz  REAL8     0.0
      fpusw dw        0
      fpucw dw        0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

mov esi,ptrV1
mov edi,ptrVR

fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(0)
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(0)
faddp st(1),st
fld dword ptr (Vector3D PTR [esi]).z
fmul st,st(0)
faddp st(1),st
fsqrt         
fld1
fdivr
fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).x
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).y
fmul dword ptr (Vector3D PTR [esi]).z
fstp dword ptr (Vector3D PTR [edi]).z
 
ret

Vector3D_Normalize_FPU ENDP

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

normalize proc pv1:DWORD, pvr:DWORD

    mov ecx, pv1
    mov edx, pvr
    fld [ecx].Vector3D.z
    fmul [ecx].Vector3D.z
    fld [ecx].Vector3D.y
    fmul [ecx].Vector3D.y
    faddp st(1), st
    fld [ecx].Vector3D.x
    fmul [ecx].Vector3D.x
    faddp st(1), st
    fsqrt
    fld4 1.0
    fdivr
    fld [ecx].Vector3D.x
    fmul st, st(1)
    fstp [edx].Vector3D.x
    fld [ecx].Vector3D.y
    fmul st, st(1)
    fstp [edx].Vector3D.y
    fmul [ecx].Vector3D.z
    fstp [edx].Vector3D.z
    ret

normalize endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    invoke Sleep, 3000

    ; ---------------------------------
    ; Display the current value of the
    ; control word PC field (bits 9-8).
    ; ---------------------------------

    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10,13,10

    ; ------------------------------------------------------
    ; Restore FPU to initialized state to set the PC field
    ; to 11b = 64 bits, then read the value and display it.
    ; ------------------------------------------------------

    finit
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10,13,10

    invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    fld vr.x
    fstp dblx
    fld vr.y
    fstp dbly
    fld vr.z
    fstp dblz
    invoke crt_printf, chr$("%.8f  %.8f  %.8f%c"), dblx, dbly, dblz, 10

    invoke normalize, ADDR v1, ADDR vr
    fld vr.x
    fstp dblx
    fld vr.y
    fstp dbly
    fld vr.z
    fstp dblz
    invoke crt_printf, chr$("%.8f  %.8f  %.8f%c%c"), dblx, dbly, dblz, 10, 10

    ; -------------------------------------------------------
    ; Set the PC field in the control word to 11b = 53 bits,
    ; then read the value back and display it.
    ; -------------------------------------------------------

    fstcw fpucw
    and fpucw, 1111111011111111b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    fstcw fpucw
    and fpucw, 1111111011111111b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10,13,10

    ; -------------------------------------------------------
    ; Set the PC field in the control word to 00b = 24 bits,
    ; then read the value back and display it.
    ; -------------------------------------------------------

    fstcw fpucw
    and fpucw, not 1100000000b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    fstcw fpucw
    and fpucw, not 1100000000b
    fldcw fpucw
    fstcw fpucw
    print "Control word PC field = "
    movzx eax, fpucw
    shr eax, 8
    and eax, 11b
    print uhex$(eax),"h",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10,13,10

    invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    fld vr.x
    fstp dblx
    fld vr.y
    fstp dbly
    fld vr.z
    fstp dblz
    invoke crt_printf, chr$("%.8f  %.8f  %.8f%c"), dblx, dbly, dblz, 10

    invoke normalize, ADDR v1, ADDR vr
    fld vr.x
    fstp dblx
    fld vr.y
    fstp dbly
    fld vr.z
    fstp dblz
    invoke crt_printf, chr$("%.8f  %.8f  %.8f%c%c"), dblx, dbly, dblz, 10, 10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

Control word PC field = 00000002h

Control word PC field = 00000003h
130 cycles, Vector3D_Normalize_FPU
130 cycles, normalize

0.26726124  0.53452247  0.80178374
0.26726124  0.53452247  0.80178374

Control word PC field = 00000002h
113 cycles, Vector3D_Normalize_FPU
Control word PC field = 00000002h
113 cycles, normalize

Control word PC field = 00000000h
70 cycles, Vector3D_Normalize_FPU
Control word PC field = 00000000h
70 cycles, normalize

0.26726124  0.53452247  0.80178374
0.26726124  0.53452247  0.80178374

eschew obfuscation

johnsa

So I think the conclusions are:

1) If you don't ever need more that REAL4 in your code.. set the PC... it'll double the FPU speed when using div/sqrt! (I'm curious to know if the C#/C++ compiler was already doing this with optimizations on).
2) Both fmul st,mem and fmul st,st(0) perform identically on P3, on my PM the st,st(0) version is a few cycles less.
3) After all of this, the assembly langauge version is now :
      20ms faster than the optimized c++ on 1,000,000 iterations with full precision fpu
      2x as fast set to 53bit mantissa. REAL8
      3x as fast set to 24bit mantissa. REAL4
4) Avoid stack faults at ALL costs, penalty is the worst i've seen yet.

Obviously if you took the C++ testpiece and updated it to use reciprocal, fsqrt instead of calling the std sqrt lib function and applied the FPU PC change it would be the same.