News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Performance / Timing Wierdness ASM vs C#

Started by johnsa, June 15, 2008, 10:48:31 PM

Previous topic - Next topic

johnsa

So in my quest for optimizations, comparison timings etc and while working on a vector math library I decided to throw in
a vector-normalize routine using FPU. Below is the outcome of that after a few variations with which this seemed to perform the
best.


align 16
Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

mov esi,ptrV1
mov edi,ptrVR

fld dword ptr [esi]
fmul st,st(0)
fld dword ptr [esi+4]
fmul st,st(0)
faddp st(1),st
fld dword ptr [esi+8]
fmul st,st(0)
faddp st(1),st
fsqrt
fld dword ptr [esi]
fdiv st,st(1)
fstp dword ptr [edi]
fld dword ptr [esi+4]
fdiv st,st(1)
fstp dword ptr [edi+4]
fld dword ptr [esi+8]
fdiv st,st(1)
fstp dword ptr [edi+8]

ret
Vector3D_Normalize_FPU ENDP


So a testpiece with masm using both GetTickCount and Timers.asm as supplied on this forum running 1,000,000 iterations
provide a result of 650ms.

So I decide to whip up a quick benchmark in Visual Studio 2008 C# which runs 1,000,000 iterations as well and timed
with Environment.TickCount (corresponds to GetTickCount) / StopWatch Object(more precise).

It's code is very simple and looks like:



public static vector Normalize(vector in)
{
vector result = new vector();
float veclength = (float)Math.Sqrt( (in.x*in.x) + (in.y*in.y) + (in.z*in.z) );
result.x = in.x / veclength;
result.y = in.y / veclength;
result.z = in.z / veclength;
return(result);
}
.
.
.

for (int i=0;i<1000000;i++)
{
vecB = Normalize(vecA);
}



I checked the disassembly and the fpu/asm code is very similar to my function, with a bit of extra waffle as expected...
Timing result for 1 million iterations: 150ms.
What the hell??
How can the debug C# code be 5 times faster than the asm function??

I checked and all loops, proc and variables are aligned.. I also double checked to make sure the C# method wasnt
automatically being inlined..(which it's not)..

What gives?
The only possible explanation I have is that C#'s timers are lying... or else my asm code is honestly 5 times slower...

johnsa

Ok... so to make sure the timers aren't lying I increased the iterations to 100 million...  so that it was in a range that i could verify by manually counting.. and true enough.. the C# ran in 14.8seconds and the ASM testpiece ran in 67 seconds..... what on earth? I also confirmed by looking at the disassembly that C#.NET uses a straight fsqrt and doesn't have any custom (faster) sqrt implementation.


hutch--

John,

You should be able to use the direct API in C++ and I would be inclined to use an identical timing method on both to make sure you are benchmarking the same thing. A comment from another member recently was if you don't need the higher level of precision that you may get a substantial speed increase using SSE2/3/4 instead.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

NightWare

johnsa,

i've just recoded this one 2 days ago so it must be tested, but it seems fast :
movaps XMM0,OWORD PTR [esi] ;; XMM0 = W,Z,Y,X
movaps XMM2,XMM0 ;; XMM2 = W,Z,Y,X
mulps XMM0,XMM0 ;; XMM0 = W^2,Z^2,Y^2,X^2
pshufd XMM1,XMM0,04Eh ;; XMM1 = Y^2,X^2,W^2,Z^2
addps XMM1,XMM0 ;; XMM1 = W^2+Y^2,Z^2+X^2,Y^2+W^2,X^2+Z^2
pshufd XMM0,XMM1,0B1h ;; XMM0 = Z^2+X^2,W^2+Y^2,X^2+Z^2,Y^2+W^2
addps XMM0,XMM1 ;; XMM0 = Z^2+X^2+W^2+Y^2,W^2+Y^2+Z^2+X^2,X^2+Z^2+Y^2+W^2,Y^2+W^2+X^2+Z^2
rsqrtps XMM0,XMM0 ;; XMM0 = 1/(X^2+Y^2+Z^2+W^2),1/(X^2+Y^2+Z^2+W^2),1/(X^2+Y^2+Z^2+W^2),1/(X^2+Y^2+Z^2+W^2)
mulps XMM0,XMM2 ;; XMM0 = W*1/(X^2+Y^2+Z^2+W^2),Z*1/(X^2+Y^2+Z^2+W^2),Y*1/(X^2+Y^2+Z^2+W^2),X*1/(X^2+Y^2+Z^2+W^2)
movaps OWORD PTR [edi],XMM0 ;; XMM0 = W/(X^2+Y^2+Z^2+W^2),Z/(X^2+Y^2+Z^2+W^2),Y/(X^2+Y^2+Z^2+W^2),X/(X^2+Y^2+Z^2+W^2)

Neo

Using shufps should be faster than pshufd, because the CPU doesn't need to switch SSE register modes.  Also, using the following instead of shufps,addps,shufps,addps might be even faster, even though haddps has higher latency and throughput than most SSE instructions:
haddps xmm0,xmm0
haddps xmm0,xmm0

MichaelW

John,

I don't have Visual Studio 2008 C# so I did my comparisons with the Microsoft Visual C++ Toolkit 2003, using this code:

#include <windows.h>

// The pragma optimize statements are to prevent the compiler
// from optimizing the loop, eliminating the redundant calls by
// placing a copy of the code inline and executing it once.

typedef struct _vector {
  float x,y,z;
}vector;

#pragma optimize( "t", on )
vector normalize( vector in )
{
  vector result;
  float veclength = (float)sqrt( (in.x*in.x) + (in.y*in.y) + (in.z*in.z) );
result.x = in.x / veclength;
result.y = in.y / veclength;
result.z = in.z / veclength;
return(result);
}

#pragma optimize( "", off )
int main(void)
{
    vector r,v={1,2,3};
    int i,t1,t2;

    t1=GetTickCount();

    for (i=0 ; i<1000000 ; i++){
      r = normalize(v);
    }

    t2=GetTickCount();

    printf("%d\n",t2-t1);
    printf("%f\t%f\t%f\n",r.x,r.y,r.z);
    getch();
    return 0;
}


And this command line:

cl /O2 /G6 /FA normalize.c


Running on my P3 the time varied somewhat, but a typical result would be:

281
0.267261        0.534522        0.801784


This is the asm code that the compiler generated:

PUBLIC _normalize
PUBLIC __real@3f800000
EXTRN __fltused:NEAR
; COMDAT __real@3f800000
; File c:\program files\microsoft visual c++ toolkit 2003\my\normalize.c
CONST SEGMENT
__real@3f800000 DD 03f800000r ; 1
; Function compile flags: /Ogty
CONST ENDS
; COMDAT _normalize
_TEXT SEGMENT
_result$ = -12 ; size = 12
$T74080 = 8 ; size = 4
_in$ = 12 ; size = 12
_normalize PROC NEAR ; COMDAT
; Line 13
sub esp, 12 ; 0000000cH
; Line 15
fld DWORD PTR _in$[esp+16]
; Line 19
mov eax, DWORD PTR $T74080[esp+8]
fmul DWORD PTR _in$[esp+16]
mov ecx, eax
fld DWORD PTR _in$[esp+12]
fmul DWORD PTR _in$[esp+12]
faddp ST(1), ST(0)
fld DWORD PTR _in$[esp+8]
fmul DWORD PTR _in$[esp+8]
faddp ST(1), ST(0)
fsqrt
fdivr DWORD PTR __real@3f800000
fld DWORD PTR _in$[esp+8]
fmul ST(0), ST(1)
fstp DWORD PTR _result$[esp+12]
mov edx, DWORD PTR _result$[esp+12]
fld DWORD PTR _in$[esp+12]
mov DWORD PTR [ecx], edx
fmul ST(0), ST(1)
fstp DWORD PTR _result$[esp+16]
mov edx, DWORD PTR _result$[esp+16]
mov DWORD PTR [ecx+4], edx
fmul DWORD PTR _in$[esp+16]
fstp DWORD PTR _result$[esp+20]
mov edx, DWORD PTR _result$[esp+20]
mov DWORD PTR [ecx+8], edx
; Line 20
add esp, 12 ; 0000000cH
ret 0
_normalize ENDP


In my asm version I basically duplicated the compiler-generated code, without the integer instructions.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc

    vector struct
      x REAL4 ?
      y REAL4 ?
      z REAL4 ?
    vector ends

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      v1    vector  <1.0,2.0,3.0>
      vr    vector  <>
      dblx  REAL8   0.0
      dbly  REAL8   0.0
      dblz  REAL8   0.0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

  mov esi,ptrV1
  mov edi,ptrVR

  fld dword ptr [esi]
  fmul st,st(0)
  fld dword ptr [esi+4]
  fmul st,st(0)
  faddp st(1),st
  fld dword ptr [esi+8]
  fmul st,st(0)
  faddp st(1),st
  fsqrt
  fld dword ptr [esi]
  fdiv st,st(1)
  fstp dword ptr [edi]
  fld dword ptr [esi+4]
  fdiv st,st(1)
  fstp dword ptr [edi+4]
  fld dword ptr [esi+8]
  fdiv st,st(1)
  fstp dword ptr [edi+8]

  ret

Vector3D_Normalize_FPU ENDP

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

normalize proc pv1:DWORD, pvr:DWORD

    mov ecx, pv1
    mov edx, pvr
    fld [ecx].vector.z
    fmul [ecx].vector.z
    fld [ecx].vector.y
    fmul [ecx].vector.y
    faddp st(1), st
    fld [ecx].vector.x
    fmul [ecx].vector.x
    faddp st(1), st
    fsqrt
    fld4 1.0
    fdivr
    fld [ecx].vector.x
    fmul st, st(1)
    fstp [edx].vector.x
    fld [ecx].vector.y
    fmul st, st(1)
    fstp [edx].vector.y
    fmul [ecx].vector.z
    fstp [edx].vector.z
    ret

normalize endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke normalize, ADDR v1, ADDR vr

    fld vr.x
    fstp dblx
    fld vr.y
    fstp dbly
    fld vr.z
    fstp dblz

    invoke crt_printf, chr$("%f  %f  %f %c"), dblx, dbly, dblz, 10

    mov ebx, 1000000
    invoke GetTickCount
    push eax
    .WHILE ebx
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
      dec ebx
    .ENDW
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),13,10

    mov ebx, 1000000
    invoke GetTickCount
    push eax
    .WHILE ebx
      invoke normalize, ADDR v1, ADDR vr
      dec ebx
    .ENDW
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


In my results the time ratio between your code and the compiler's code is 6.88.

0.267261  0.534522  0.801784
1932
241


I didn't have time to determine exactly what is causing the large difference, but I doubt that replacing three fdiv instructions with an fdivr and three fmul instructions could account for it. And I didn't try any other codings to see if the time could be improved. I would like to see how the timing compares for the SSE code.
eschew obfuscation

c0d1f1ed

fdiv has a very high latency and is not pipelined. So you should definitely replace the three divisions by one reciprocal and three multiplications. The rest of the performance difference comes from inlining the call.

This illustrates just how smart compilers are these days. Only for SIMD code they haven't really caught up yet. NightWare's SSE code will definitely be very fast. Even faster though is normalizing four vectors in parallel.

c0d1f1ed

Quote from: Neo on June 16, 2008, 03:42:55 AM
Using shufps should be faster than pshufd, because the CPU doesn't need to switch SSE register modes.

There is no such thing as switching SSE register modes. Using pshufd instead of only shufps resulted in a few percent higher performance for my (floating-point) vector code. LLVM uses pshufd aggressively as well.

QuoteAlso, using the following instead of shufps,addps,shufps,addps might be even faster, even though haddps has higher latency and throughput than most SSE instructions:
haddps xmm0,xmm0
haddps xmm0,xmm0

haddps is microcoded, resulting in decoder stalls. It only appears to be beneficial when you need all four sums, which isn't the case for a dot product.

MichaelW

Quote from: c0d1f1ed on June 16, 2008, 11:52:53 AM
fdiv has a very high latency and is not pipelined. So you should definitely replace the three divisions by one reciprocal and three multiplications. The rest of the performance difference comes from inlining the call.

What do you mean by "inlining the call"? And how does the high latency explain the difference when the fdiv instructions are independent? Using the exe in the attachment, on my P3 I get cycle counts of 934 and 130. Per Agner Fog, for a P3 the latencies for fmul and fdiv, assuming the worst-case 64-bit precision, should be 5/6 and 38, and the reciprocal throughputs 2 and 37. Even if I assume that the fdiv instructions are dependent, add the latency and reciprocal throughput together, and ignore the effect of the increased number of fmul instructions, I can't see how the two additional fdiv instructions could explain  a difference of 804 cycles.

QuoteThis illustrates just how smart compilers are these days.

Or not. By using the same basic algorithm I was able to code a faster, and probably smaller, procedure in one try. And I suspect that it can be improved significantly.


[attachment deleted by admin]
eschew obfuscation

NightWare

Quote from: MichaelW on June 16, 2008, 04:01:58 AM
I would like to see how the timing compares for the SSE code.
hi,
here the results on my core2 (i've multiplied the number of loop by 10, otherwise the results are not significant due to the 7ms imprecision of gettickcount, and aligned the vectors to be usable by sse+) :
7754
436
63
i've forgotten to add your last code in the test...

[attachment deleted by admin]

Neo

Cool.  I tried a few different variations and got some semi-surprising results:
11544 - The one with 3 fdivs
562 - The one with 1 fdivr
109 - Using haddps,haddps
94 - Using shufps,addps,shufps,addps
78 - Using pshufd,addps,pshufd,addps
47 - Inlined using pshufd,addps,pshufd,addps

The system is a Core 2 Duo 1.66GHz with 2GB of RAM running Windows Vista 64-bit.  The program was 32-bit, which might skew results quite a bit.

johnsa

Hi,

I tried re-organizing the fdivs, tried replacing it with a reciprocal multiply and then end result is much the same. The asm hand-coded version which is almost identical to what i saw come out of the c# compiler is approx. 5 times slower still..

I'm aware that moving routines like this to SSE is the idea, which I already have for both AOS and SOA models, however I wanted to have a simple high precision fpu based version of the functions as well.

I'm really stumped as to why the version is SOOO much slower than the compiler output.

MichaelW

#12
The problem is not fdiv in general, but the way it is being (incorrectly) used. I modified my cycle count code to test various forms of fdiv, and determined that the problem is a FPU stack fault. If you uncomment the code that tests the stack fault bit in the FPU status word, for the slow FDIV instruction sequences the code will display 7 OKs, and then display alternating SF and OK for the duration of the test.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    vector struct
      x REAL4 ?
      y REAL4 ?
      z REAL4 ?
    vector ends

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      v1        vector  <1.0,2.0,3.0>
      vr        vector  <>
      fpusw     dw      0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

  mov esi,ptrV1
  mov edi,ptrVR

  fld dword ptr [esi]
  fmul st,st(0)
  fld dword ptr [esi+4]
  fmul st,st(0)
  faddp st(1),st
  fld dword ptr [esi+8]
  fmul st,st(0)
  faddp st(1),st
  fsqrt
  fld dword ptr [esi]
  fdiv st,st(1)
  fstp dword ptr [edi]
  fld dword ptr [esi+4]
  fdiv st,st(1)
  fstp dword ptr [edi+4]
  fld dword ptr [esi+8]
  fdiv st,st(1)
  fstp dword ptr [edi+8]

  ret

Vector3D_Normalize_FPU ENDP

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

normalize proc pv1:DWORD, pvr:DWORD

    mov ecx, pv1
    mov edx, pvr
    fld [ecx].vector.z
    fmul [ecx].vector.z
    fld [ecx].vector.y
    fmul [ecx].vector.y
    faddp st(1), st
    fld [ecx].vector.x
    fmul [ecx].vector.x
    faddp st(1), st
    fsqrt
    fld4 1.0
    fdivr
    fld [ecx].vector.x
    fmul st, st(1)
    fstp [edx].vector.x
    fld [ecx].vector.y
    fmul st, st(1)
    fstp [edx].vector.y
    fmul [ecx].vector.z
    fstp [edx].vector.z

    fld DWORD PTR [ecx]
    fld DWORD PTR [ecx]
    ;fdiv                    ; +12 cycles
    ;fdivr                   ; +12 cycles
    ;fdiv st, st(1)          ; +914 cycles stack fault
    fdiv st(1), st          ; +914 cycles stack fault
    ;fdivp st(1), st         ; +12 cycles

    ; ------------------------------------------------
    ; Uncomment this to see state of stack fault bit.
    ; ------------------------------------------------
    comment ~
    fstsw fpusw
    fwait
    test fpusw, 40h
    jz  @F
    print "SF",13,10
  @@:
    print "OK",13,10
    ~

    fstp DWORD PTR [ecx]

    ret

normalize endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke Vector3D_Normalize_FPU, ADDR vr, ADDR v1
    counter_end
    print ustr$(eax)," cycles, Vector3D_Normalize_FPU",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke normalize, ADDR v1, ADDR vr
    counter_end
    print ustr$(eax)," cycles, normalize",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


And in case it's not obvious from the comments, when fdiv is used correctly on my P3 the following sequence executes in 12 cycles:

    fld DWORD PTR [ecx]
    fld DWORD PTR [ecx]
    fdiv
    fstp DWORD PTR [ecx]


And to cover all bases, if in the timed test of Vector3D_Normalize_FPU I follow the last:

fdiv st,st(1)

With:

ffree st(1)

Then there is no stack fault and the results on my P3 look much more reasonable:


0.267261  0.534522  0.801784
0.267261  0.534522  0.801784
320
231




eschew obfuscation

c0d1f1ed

Nice observation MichaelW! That stack fault penalty is shockingly high. Good thing that can't happen with SSE, and compilers avoid it.

johnsa

Awesome spot!
I cannot believe an FPU stack fault is that significant!

Important tip: If you pushed it.. you should pop it :)
My result are much better now:

1,000,000 iterations:
FPU = 536ms
SSE = 106ms
C#  = 156ms

Updated FPU Version:


align 16
Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD

mov esi,ptrV1
mov edi,ptrVR

fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(0)
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(0)
faddp st(1),st
fld dword ptr (Vector3D PTR [esi]).z
fmul st,st(0)
faddp st(1),st
fsqrt                       
fld1
fdivr
fld dword ptr (Vector3D PTR [esi]).x
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).x
fld dword ptr (Vector3D PTR [esi]).y
fmul st,st(1)
fstp dword ptr (Vector3D PTR [edi]).y
fmul dword ptr (Vector3D PTR [esi]).z
fstp dword ptr (Vector3D PTR [edi]).z

ret
Vector3D_Normalize_FPU ENDP


No stack faults reported over 1000 test runs.

Still about 400ms (over 1,000,000 iterations) slower than the C# compiled version..

For reference here is the C# dis-asm output:


            float rlen = (float)Math.Sqrt((i.x * i.x) + (i.y * i.y) + (i.z * i.z));
0000004b  fld         dword ptr [esi+4]
0000004e  fmul        st,st(0)
00000050  fld         dword ptr [esi+8]
00000053  fmul        st,st(0)
00000055  faddp       st(1),st
00000057  fld         dword ptr [esi+0Ch]
0000005a  fmul        st,st(0)
0000005c  faddp       st(1),st
0000005e  fstp        qword ptr [ebp-58h]
00000061  fld         qword ptr [ebp-58h]
00000064  fsqrt           
00000066  fstp        qword ptr [ebp-50h]
00000069  fld         qword ptr [ebp-50h]
0000006c  fstp        dword ptr [ebp-44h]
            r.x = i.x / rlen;
0000006f  fld         dword ptr [esi+4]
00000072  fdiv        dword ptr [ebp-44h]
00000075  fstp        dword ptr [ebx+4]
            r.y = i.y / rlen;
00000078  fld         dword ptr [esi+8]
0000007b  fdiv        dword ptr [ebp-44h]
0000007e  fstp        dword ptr [ebx+8]
            r.z = i.z / rlen;
00000081  fld         dword ptr [esi+0Ch]
00000084  fdiv        dword ptr [ebp-44h]
00000087  fstp        dword ptr [ebx+0Ch]
            return (r);