Print Page - SSE VS FPU

Title: SSE VS FPU
Post by: Farabi on February 27, 2012, 11:02:16 AM

I dont see any big difference, it only just simpler.


Vec_SubSSE	proc	uses ebx DestVec:dword, A:dword, B:dword 
			mov	eax, DestVec
			mov	ebx, A
			mov	ecx, B
		;	fld	dword ptr [ebx + VERTEX.x]
		;	fsub	dword ptr [ecx + VERTEX.x]
		;	fstp	dword ptr [eax + VERTEX.x]
;
;			fld	dword ptr [ebx + VERTEX.y]
;			fsub	dword ptr [ecx + VERTEX.y]
;			fstp	dword ptr [eax + VERTEX.y]
;
;			fld	dword ptr [ebx + VERTEX.z]
;			fsub	dword ptr [ecx + VERTEX.z]
;			fstp	dword ptr [eax + VERTEX.z]
			
			movups xmm0,[ebx]
			movups xmm1,[ecx]
			subps xmm0,xmm1
			movups [eax],xmm0
			
			ret
Vec_SubSSE			endp

Vec_Sub			proc	uses ebx DestVec:dword, A:dword, B:dword 
			mov	eax, DestVec
			mov	ebx, A
			mov	ecx, B
			fld	dword ptr [ebx + VERTEX.x]
			fsub	dword ptr [ecx + VERTEX.x]
			fstp	dword ptr [eax + VERTEX.x]

			fld	dword ptr [ebx + VERTEX.y]
			fsub	dword ptr [ecx + VERTEX.y]
			fstp	dword ptr [eax + VERTEX.y]

			fld	dword ptr [ebx + VERTEX.z]
			fsub	dword ptr [ecx + VERTEX.z]
			fstp	dword ptr [eax + VERTEX.z]
			
			ret
Vec_Sub			endp

Title: Re: SSE VS FPU
Post by: dedndave on February 27, 2012, 11:50:29 AM

if you don't see a difference, then the code may not be executed as often as you think
the SSE code above should be a few times faster than the FPU code
you may need to formulate the proper test to see the difference
it's likely that most of the time is consumed elsewhere, making it hard to see a change

Title: Re: SSE VS FPU
Post by: Farabi on February 27, 2012, 12:46:36 PM

You'll surprised. mul took 1 ms an fmul 913 ms. If OpenGL used too many FPU or SSE2, I think I'll beat them. Im still gathering info what they done.

Title: Re: SSE VS FPU
Post by: oex on February 27, 2012, 12:50:56 PM

Simply be aware that SSE is less supported....

I think SSE2 is supported now by most computer users but good programming practice means you need to check at the start of your application for the relevent instruction support....

Recently I checked out OceanJeffs fireworks demo.... It worked on my AMD but not on my far newer Intel....

PS. Use MichaelW's timing script, your timings look a little shakey, have a look at laboratory test peices to understand how to better time your code....

Title: Re: SSE VS FPU
Post by: jj2007 on February 27, 2012, 01:15:25 PM

Quote from: Farabi on February 27, 2012, 12:46:36 PM
You'll surprised. mul took 1 ms an fmul 913 ms.

It means there is a bug in the FPU code, probably an exception. Fmul is only marginally slower that mulsd, see below. And your Vec_SubSSE is even a bit slower than the FPU version.

Code Select

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
194     cycles for 100*fmul
179     cycles for 100*mulsd

13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
193     cycles for 100*fmul
180     cycles for 100*mulsd

Title: Re: SSE VS FPU
Post by: oex on February 27, 2012, 02:41:23 PM

Code Select


Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4)
5       cycles for Vec_SubSSE
12      cycles for Vec_Sub
153     cycles for 100*fmul
587     cycles for 100*mulsd

9       cycles for Vec_SubSSE
15      cycles for Vec_Sub
154     cycles for 100*fmul
602     cycles for 100*mulsd

Title: Re: SSE VS FPU
Post by: dancho on February 27, 2012, 03:42:06 PM

you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...

Title: Re: SSE VS FPU
Post by: jj2007 on February 27, 2012, 04:27:24 PM

Quote from: dancho on February 27, 2012, 03:42:06 PM
you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...

Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...

Title: Re: SSE VS FPU
Post by: dancho on February 27, 2012, 04:43:05 PM

Quote
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...

yes,ofc,
data must be aligned on 16 bytes boundary address...

Title: Re: SSE VS FPU
Post by: qWord on February 27, 2012, 04:55:23 PM

With proper alignment and code I get the following results:

Code Select

Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
193     cycles for 100*fmul
630     cycles for 100*mulsd

4       cycles for Vec_SubSSE
8       cycles for Vec_Sub
202     cycles for 100*fmul
619     cycles for 100*mulsd

Code Select

movaps xmm0,[ebx]
subps xmm0,[ecx]
movaps [eax],xmm0

For such tasks SSEx was introduced - all you need to do, is to set up the right conditions. Also, implementing this functions (6 instructions!) as a macro is highly recommended.

EDIT: @jj, there is a small bug for mulsd: -> movlps --> movsd
then I get:

Quote3 cycles for Vec_SubSSE
8 cycles for Vec_Sub
167 cycles for 100*fmul
174 cycles for 100*mulsd

3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
204 cycles for 100*fmul
196 cycles for 100*mulsd

3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
180 cycles for 100*fmul
176 cycles for 100*mulsd

4 cycles for Vec_SubSSE
7 cycles for Vec_Sub
173 cycles for 100*fmul
175 cycles for 100*mulsd

Title: Re: SSE VS FPU
Post by: jj2007 on February 27, 2012, 06:51:42 PM

With all those "improvements" the SSE2 code gets, wow, as fast as the FPU:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
195     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd

9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
196     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd

P.S.: I am a fan of SSE2 - there is a lot of it in MasmBasic :green

Title: Re: SSE VS FPU
Post by: qWord on February 27, 2012, 07:06:58 PM

QuoteIntel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
168 cycles for 100*fmul
590 cycles for 100*mulsd movlps ;(ps<>sd)
216 cycles for 100*mulsd movsd

3 cycles for Vec_SubSSE
8 cycles for Vec_Sub
164 cycles for 100*fmul
617 cycles for 100*mulsd movlps ;(ps<>sd)
172 cycles for 100*mulsd movsd

--- ok ---

... and again, it is nice to see that the FPU is still on an equal footing with SSEx :bg

Title: Re: SSE VS FPU
Post by: Farabi on February 29, 2012, 07:51:56 AM

It seems to be SSE only faster on Intel processor.

Title: Re: SSE VS FPU
Post by: oex on February 29, 2012, 11:04:25 AM

Quote from: Farabi on February 29, 2012, 07:51:56 AM
It seems to be SSE only faster on Intel processor.

Different instructions, different processors.... One SSE MemCopy function I used on my old AMD was over 2*Faster than movsd....

Title: Re: SSE VS FPU
Post by: Farabi on March 03, 2012, 10:01:42 AM

O I forget, what I mean with mul is integer x86 mul, not SSE or FPU. It is far superior faster than Floating point version. Why not using Div and Mul for the floating point subtitutions?

Title: Re: SSE VS FPU
Post by: jj2007 on March 03, 2012, 10:10:19 AM

Quote from: Farabi on March 03, 2012, 10:01:42 AM
It is far superior faster than...

Prove it.

Title: Re: SSE VS FPU
Post by: OceanJeff32 on March 03, 2012, 10:42:51 AM

That wasn't my fireworks demo, btw, just clarifying, but I did experiment with it, and learned about MMX and SSE and how they work from that.

Very cool stuff!

it's been a while since i've visited ronybc.com, but his website looks like it's been taken over by BUGS and ADS! Beware upon visiting.

later,

jeff c
:U

Title: Re: SSE VS FPU
Post by: qWord on March 03, 2012, 11:30:30 AM

Quote from: Farabi on March 03, 2012, 10:01:42 AMWhy not using Div and Mul for the floating point subtitutions?

floating point <> integer

Title: Re: SSE VS FPU
Post by: Farabi on March 03, 2012, 12:33:32 PM

Mul 1 ms, fmul 918 ms.

Title: Re: SSE VS FPU
Post by: Farabi on March 03, 2012, 12:45:37 PM

We could use what they call fixed point as a subtitution for the Floating points. I proposed mul and div instruction for the precicions, but shr-ing 32-bit is a lot faster than doing so.

Title: Re: SSE VS FPU
Post by: oex on March 03, 2012, 02:16:38 PM

Quote from: OceanJeff32 on March 03, 2012, 10:42:51 AM
That wasn't my fireworks demo, btw, just clarifying, but I did experiment with it, and learned about MMX and SSE and how they work from that.

Very cool stuff!

it's been a while since i've visited ronybc.com, but his website looks like it's been taken over by BUGS and ADS! Beware upon visiting.

later,

jeff c
:U

:lol Hi Jeff, I checked the code briefly but didnt find the offending intel instruction, it was good code though.... Sorry wasnt an accusation just a heads up :lol, I wondered if you would see it

Title: Re: SSE VS FPU
Post by: jj2007 on March 03, 2012, 05:09:31 PM

Quote from: Farabi on March 03, 2012, 12:33:32 PM
Mul 1 ms, fmul 918 ms.

Code Select

	     timer_begin 10000000, REALTIME_PRIORITY_CLASS
	     	fmul
	    timer_end

RT... (http://danielsantos.org/2007/07/24/rtfm/), e.g. this one (http://www.website.masmforum.com/tutorials/fptute/index.html)

Title: Re: SSE VS FPU
Post by: dedndave on March 03, 2012, 05:21:52 PM

Jochen,
i think his times are for completion of entire functions, one using mul and one using fmul
much of the results will depend on how they are written, of course :P

Title: Re: SSE VS FPU
Post by: jj2007 on March 03, 2012, 07:54:12 PM

Dave,

He puts a simple fmul between timer_begin and timer_end. After (in the most optimistic scenario) the 8th iteration, he gets an exception, and the FPU grinds down to a halt. I had mentioned this already in reply #4 (http://www.masm32.com/board/index.php?topic=18425.msg155517#msg155517), but why read posts if you can boldly state that the FPU is shit, and SSE is the future?

Title: Re: SSE VS FPU
Post by: dedndave on March 03, 2012, 10:43:20 PM

the FPU is pretty fast
Raymond has proved that on more than one occasion :P

the difference here is between floats and integers, i think

Title: Re: SSE VS FPU
Post by: jj2007 on March 03, 2012, 11:18:28 PM

Well, not really... it's a bit more complex:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
494     cycles for 100*mul eax
438     cycles for 100*fmul, properly used
17274   cycles for 100*fmul, improperly used

494     cycles for 100*mul eax
438     cycles for 100*fmul, properly used
17252   cycles for 100*fmul, improperly used

P.S.: Google just told me we've treated that already (http://www.masm32.com/board/index.php?topic=18044.msg152408#msg152408), not so long ago :bg

Title: Re: SSE VS FPU
Post by: Farabi on March 04, 2012, 10:32:29 AM

I dont get it, so on my code, the FPU simply error and halt?

Title: Re: SSE VS FPU
Post by: dedndave on March 04, 2012, 11:49:30 AM

read the FPU tutorial by Ray :U

when you "put something" into the FPU, it gets pushed onto the internal stack
when the FPU stack is full, bad things can happen :P
to make space, pop something out
this can generally be done by using an instruction that pops and saves to memory at the same time
but - it also means there has to be an empty register to start with - you get 8 of them

i am surprised that i do not see more use of local variables for storage of reals...

Code Select

fstp real8 ptr [ebp-16]
very efficient :U

of course, if you make a local for a real10, it should be 12 bytes :P

Title: Re: SSE VS FPU
Post by: qWord on March 04, 2012, 12:42:48 PM

Quote from: dedndave on March 04, 2012, 11:49:30 AMi am surprised that i do not see more use of local variables for storage of reals...
Code Select Expand
fstp real8 ptr [ebp-16]
very efficient :U

I do not know why, but I got the impression that especially assembler programmers seems to be infested by the use-globals-as-much-as-possible pest :bg

Title: Re: SSE VS FPU
Post by: MichaelW on March 04, 2012, 01:10:06 PM

Onan,

By default the FPU handles exceptions internally, so the only evidence you see of the exceptions are incorrect results, and if you bother to check the execution time, much slower execution. This code detects the exceptions by checking the FPU status word.

Code Select


;==============================================================================
include \masm32\include\masm32rt.inc
;==============================================================================
.data
    junk real8 ?
.code
;==============================================================================
ShowStatusWord proc
    local sw:word
    fstsw sw
    test sw, 1111111b
    jnz @F
    printf(".")
  @@:
    test sw, 0000001b
    jz  @F
    printf("I")
  @@:
    test sw, 0000010b
    jz  @F
    printf("D")
  @@:
    test sw, 0000100b
    jz  @F
    printf("Z")
  @@:
    test sw, 0001000b
    jz  @F
    printf("O")
  @@:
    test sw, 0010000b
    jz  @F
    printf("U")
  @@:
    test sw, 0100000b
    jz  @F
    printf("P")
  @@:
    test sw, 1000000b
    jz  @F
    printf("S")
  @@:
    ret
ShowStatusWord endp
;==============================================================================
start:
;==============================================================================

    ;----------------------------------------------------
    ; The exception flags are identified as follows:
    ; I = invalid operation
    ; D = denormalized
    ; Z = zero divide
    ; O = overflow
    ; U = underflow
    ; P = precision
    ; S = stack fault
    ; See Raymond's FPU Tutorial for more information.
    ;----------------------------------------------------

    finit
    mov ebx, 20
    .while ebx
        fmul
        call ShowStatusWord
        dec ebx
    .endw
    printf("\n")

    finit
    mov ebx, 20
    .while ebx
        fld1
        fld1
        fmul
        call ShowStatusWord
        dec ebx
    .endw
    printf("\n")

    finit
    mov ebx, 20
    .while ebx
        fld1
        fld1
        fmul
        fstp junk
        call ShowStatusWord
        dec ebx
    .endw
    printf("\n\n")

    inkey
    exit
;==============================================================================
end start

Code Select


ISISISISISISISISISISISISISISISISISISISIS
.......ISISISISISISISISISISISISIS
....................

Title: Re: SSE VS FPU
Post by: jj2007 on March 04, 2012, 01:12:13 PM

Quote from: qWord on March 04, 2012, 12:42:48 PMI do not know why, but I got the impression that especially assembler programmers seems to be infested by the use-globals-as-much-as-possible pest :bg

Even worse, some combine it with the use-threads-as-much-as-possible pest ::)

Title: Re: SSE VS FPU
Post by: qWord on March 04, 2012, 04:22:45 PM

Quote from: jj2007 on March 04, 2012, 01:12:13 PM
Quote from: qWord on March 04, 2012, 12:42:48 PMI do not know why, but I got the impression that especially assembler programmers seems to be infested by the use-globals-as-much-as-possible pest :bg

Even worse, some combine it with the use-threads-as-much-as-possible pest ::)

thread safety isn't the only advantage of locals ::)

Title: Re: SSE VS FPU
Post by: dedndave on March 04, 2012, 06:53:47 PM

i think he's talking to me, qWord - lol
i use a thread whenever i need a synchronous function to be asynchronous :bg
i am also probably guilty of using global vars too often
i usually stick things in global vars to start with, then clean them up and make them local, if it is appropriate
sometimes, i get lazy and don't do the clean-up

Title: Re: SSE VS FPU
Post by: jj2007 on March 04, 2012, 09:35:08 PM

Quote from: dedndave on March 04, 2012, 06:53:47 PM
i think he's talking to me, qWord - lol

Nope, Dave, that was a pure inner-German teasing :toothy
There is nothing wrong about locals, I also use them as much as appropriate. Or, to quote Einstein, I try to make everything as simple as possible but not simpler. There are moments when you need globals. There are moments when you need threads, but that's another story ::)

Title: Re: SSE VS FPU
Post by: Farabi on March 08, 2012, 08:44:28 AM

i'll be home next day and try to make a complete procedure to compare it. currently im working on the capital city jakarta.

Title: Re: SSE VS FPU
Post by: Greenhorn__ on March 08, 2012, 02:11:02 PM

Sorry Sirs, I'm not familiar with all the SSE stuff.
Is there a little SSE tutorial out there ?

BTW, here are my results:

VertexTimings1

Code Select

AMD FX(tm)-8150 Eight-Core Processor            (SSE4)
10	cycles for Vec_SubSSE
18	cycles for Vec_Sub
189	cycles for 100*fmul
30965	cycles for 100*mulsd

10	cycles for Vec_SubSSE
15	cycles for Vec_Sub
188	cycles for 100*fmul
31100	cycles for 100*mulsd

VertexTimings2

Code Select

AMD FX(tm)-8150 Eight-Core Processor            (SSE4)
9	cycles for Vec_SubSSE
15	cycles for Vec_Sub
186	cycles for 100*fmul
30784	cycles for 100*mulsd movlps
145	cycles for 100*mulsd movsd

9	cycles for Vec_SubSSE
15	cycles for Vec_Sub
186	cycles for 100*fmul
30808	cycles for 100*mulsd movlps
145	cycles for 100*mulsd movsd

Regards
Greenhorn

Title: Re: SSE VS FPU
Post by: dedndave on March 08, 2012, 03:48:19 PM

there are a few out there - google :P

here's one
http://www.tommesani.com/Docs.html

Title: Re: SSE VS FPU
Post by: Greenhorn__ on March 08, 2012, 08:25:55 PM

Thanks Dave. :U
...And sorry for my laziness to google it by myself ... :red

Regards
Greenhorn

Title: Re: SSE VS FPU
Post by: dedndave on March 09, 2012, 12:39:24 AM

i need to learn MMX and SSE, myself :bg

Title: Re: SSE VS FPU
Post by: Farabi on March 09, 2012, 09:18:22 AM

Code Select


MulFPU proc a:real4,b:real4
	LOCAL d:dword
	
	fld a
	fmul b
	fistp d
	
	ret
MulFPU endp

Mulx86 proc a:dword,b:dword
	
	xor edx,edx
	mov ecx,a
	mov eax,b
	mul ecx
	
	ret
Mulx86 endp

Youre right, it has no different. x86 only 16 ms and FPU 29 ms, it has no difference.

Title: Re: SSE VS FPU
Post by: dedndave on March 09, 2012, 01:37:01 PM

no need to zero EDX for MUL :P

also - i find that MUL seems to work better if you can put some unrelated instruction just before

Code Select

        mov     ecx,a
        mov     eax,b
        xor     edx,edx
        mul     ecx

Title: Re: SSE VS FPU
Post by: qWord on March 09, 2012, 01:41:23 PM

what is the sense of comparing a floating point operation with an integer operation?

Title: Re: SSE VS FPU
Post by: Farabi on March 10, 2012, 04:46:34 AM

Quote from: qWord on March 09, 2012, 01:41:23 PM
what is the sense of comparing a floating point operation with an integer operation?

For the graphic optimizer routine, I though it will be faster if we used integer than floating point, but I just saw if it was the same.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Farabi on February 27, 2012, 11:02:16 AM