I dont see any big difference, it only just simpler.
Vec_SubSSE proc uses ebx DestVec:dword, A:dword, B:dword
mov eax, DestVec
mov ebx, A
mov ecx, B
; fld dword ptr [ebx + VERTEX.x]
; fsub dword ptr [ecx + VERTEX.x]
; fstp dword ptr [eax + VERTEX.x]
;
; fld dword ptr [ebx + VERTEX.y]
; fsub dword ptr [ecx + VERTEX.y]
; fstp dword ptr [eax + VERTEX.y]
;
; fld dword ptr [ebx + VERTEX.z]
; fsub dword ptr [ecx + VERTEX.z]
; fstp dword ptr [eax + VERTEX.z]
movups xmm0,[ebx]
movups xmm1,[ecx]
subps xmm0,xmm1
movups [eax],xmm0
ret
Vec_SubSSE endp
Vec_Sub proc uses ebx DestVec:dword, A:dword, B:dword
mov eax, DestVec
mov ebx, A
mov ecx, B
fld dword ptr [ebx + VERTEX.x]
fsub dword ptr [ecx + VERTEX.x]
fstp dword ptr [eax + VERTEX.x]
fld dword ptr [ebx + VERTEX.y]
fsub dword ptr [ecx + VERTEX.y]
fstp dword ptr [eax + VERTEX.y]
fld dword ptr [ebx + VERTEX.z]
fsub dword ptr [ecx + VERTEX.z]
fstp dword ptr [eax + VERTEX.z]
ret
Vec_Sub endp
if you don't see a difference, then the code may not be executed as often as you think
the SSE code above should be a few times faster than the FPU code
you may need to formulate the proper test to see the difference
it's likely that most of the time is consumed elsewhere, making it hard to see a change
You'll surprised. mul took 1 ms an fmul 913 ms. If OpenGL used too many FPU or SSE2, I think I'll beat them. Im still gathering info what they done.
Simply be aware that SSE is less supported....
I think SSE2 is supported now by most computer users but good programming practice means you need to check at the start of your application for the relevent instruction support....
Recently I checked out OceanJeffs fireworks demo.... It worked on my AMD but not on my far newer Intel....
PS. Use MichaelW's timing script, your timings look a little shakey, have a look at laboratory test peices to understand how to better time your code....
Quote from: Farabi on February 27, 2012, 12:46:36 PM
You'll surprised. mul took 1 ms an fmul 913 ms.
It means there is a bug in the FPU code, probably an exception. Fmul is only marginally slower that mulsd, see below. And your Vec_SubSSE is even a bit slower than the FPU version.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
13 cycles for Vec_SubSSE
11 cycles for Vec_Sub
194 cycles for 100*fmul
179 cycles for 100*mulsd
13 cycles for Vec_SubSSE
11 cycles for Vec_Sub
193 cycles for 100*fmul
180 cycles for 100*mulsd
Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4)
5 cycles for Vec_SubSSE
12 cycles for Vec_Sub
153 cycles for 100*fmul
587 cycles for 100*mulsd
9 cycles for Vec_SubSSE
15 cycles for Vec_Sub
154 cycles for 100*fmul
602 cycles for 100*mulsd
you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...
Quote from: dancho on February 27, 2012, 03:42:06 PM
you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...
Quote
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...
yes,ofc,
data must be aligned on 16 bytes boundary address...
With proper alignment and code I get the following results:
Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
193 cycles for 100*fmul
630 cycles for 100*mulsd
4 cycles for Vec_SubSSE
8 cycles for Vec_Sub
202 cycles for 100*fmul
619 cycles for 100*mulsd
movaps xmm0,[ebx]
subps xmm0,[ecx]
movaps [eax],xmm0
For such tasks SSEx was introduced - all you need to do, is to set up the right conditions. Also, implementing this functions (6 instructions!) as a macro is highly recommended.
EDIT: @jj, there is a small bug for mulsd: -> movlps --> movsd
then I get:
Quote3 cycles for Vec_SubSSE
8 cycles for Vec_Sub
167 cycles for 100*fmul
174 cycles for 100*mulsd
3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
204 cycles for 100*fmul
196 cycles for 100*mulsd
3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
180 cycles for 100*fmul
176 cycles for 100*mulsd
4 cycles for Vec_SubSSE
7 cycles for Vec_Sub
173 cycles for 100*fmul
175 cycles for 100*mulsd
With all those "improvements" the SSE2 code gets, wow, as fast as the FPU:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
9 cycles for Vec_SubSSE
11 cycles for Vec_Sub
195 cycles for 100*fmul
195 cycles for 100*mulsd movlps
195 cycles for 100*mulsd movsd
9 cycles for Vec_SubSSE
11 cycles for Vec_Sub
196 cycles for 100*fmul
195 cycles for 100*mulsd movlps
195 cycles for 100*mulsd movsd
P.S.: I am a fan of SSE2 - there is a lot of it in MasmBasic :green
QuoteIntel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
168 cycles for 100*fmul
590 cycles for 100*mulsd movlps ;(ps<>sd)
216 cycles for 100*mulsd movsd
3 cycles for Vec_SubSSE
8 cycles for Vec_Sub
164 cycles for 100*fmul
617 cycles for 100*mulsd movlps ;(ps<>sd)
172 cycles for 100*mulsd movsd
--- ok ---
... and again, it is nice to see that the FPU is still on an equal footing with SSEx :bg
It seems to be SSE only faster on Intel processor.
Quote from: Farabi on February 29, 2012, 07:51:56 AM
It seems to be SSE only faster on Intel processor.
Different instructions, different processors.... One SSE MemCopy function I used on my old AMD was over 2*Faster than movsd....
O I forget, what I mean with mul is integer x86 mul, not SSE or FPU. It is far superior faster than Floating point version. Why not using Div and Mul for the floating point subtitutions?
That wasn't my fireworks demo, btw, just clarifying, but I did experiment with it, and learned about MMX and SSE and how they work from that.
Very cool stuff!
it's been a while since i've visited ronybc.com, but his website looks like it's been taken over by BUGS and ADS! Beware upon visiting.
later,
jeff c
:U
Quote from: Farabi on March 03, 2012, 10:01:42 AMWhy not using Div and Mul for the floating point subtitutions?
floating point <> integer
Mul 1 ms, fmul 918 ms.
We could use what they call fixed point as a subtitution for the Floating points. I proposed mul and div instruction for the precicions, but shr-ing 32-bit is a lot faster than doing so.
Quote from: OceanJeff32 on March 03, 2012, 10:42:51 AM
That wasn't my fireworks demo, btw, just clarifying, but I did experiment with it, and learned about MMX and SSE and how they work from that.
Very cool stuff!
it's been a while since i've visited ronybc.com, but his website looks like it's been taken over by BUGS and ADS! Beware upon visiting.
later,
jeff c
:U
:lol Hi Jeff, I checked the code briefly but didnt find the offending intel instruction, it was good code though.... Sorry wasnt an accusation just a heads up :lol, I wondered if you would see it
Quote from: Farabi on March 03, 2012, 12:33:32 PM
Mul 1 ms, fmul 918 ms.
timer_begin 10000000, REALTIME_PRIORITY_CLASS
fmul
timer_end
RT... (http://danielsantos.org/2007/07/24/rtfm/), e.g. this one (http://www.website.masmforum.com/tutorials/fptute/index.html)
Jochen,
i think his times are for completion of entire functions, one using mul and one using fmul
much of the results will depend on how they are written, of course :P
Dave,
He puts a simple fmul between timer_begin and timer_end. After (in the most optimistic scenario) the 8th iteration, he gets an exception, and the FPU grinds down to a halt. I had mentioned this already in reply #4 (http://www.masm32.com/board/index.php?topic=18425.msg155517#msg155517), but why read posts if you can boldly state that the FPU is shit, and SSE is the future?
the FPU is pretty fast
Raymond has proved that on more than one occasion :P
the difference here is between floats and integers, i think
Well, not really... it's a bit more complex:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
494 cycles for 100*mul eax
438 cycles for 100*fmul, properly used
17274 cycles for 100*fmul, improperly used
494 cycles for 100*mul eax
438 cycles for 100*fmul, properly used
17252 cycles for 100*fmul, improperly used
P.S.: Google just told me we've treated that already (http://www.masm32.com/board/index.php?topic=18044.msg152408#msg152408), not so long ago :bg
I dont get it, so on my code, the FPU simply error and halt?
read the FPU tutorial by Ray :U
when you "put something" into the FPU, it gets pushed onto the internal stack
when the FPU stack is full, bad things can happen :P
to make space, pop something out
this can generally be done by using an instruction that pops and saves to memory at the same time
but - it also means there has to be an empty register to start with - you get 8 of them
i am surprised that i do not see more use of local variables for storage of reals...
fstp real8 ptr [ebp-16]
very efficient :U
of course, if you make a local for a real10, it should be 12 bytes :P
Quote from: dedndave on March 04, 2012, 11:49:30 AMi am surprised that i do not see more use of local variables for storage of reals...
fstp real8 ptr [ebp-16]
very efficient :U
I do not know why, but I got the impression that especially assembler programmers seems to be infested by the use-globals-as-much-as-possible pest :bg
Onan,
By default the FPU handles exceptions internally, so the only evidence you see of the exceptions are incorrect results, and if you bother to check the execution time, much slower execution. This code detects the exceptions by checking the FPU status word.
;==============================================================================
include \masm32\include\masm32rt.inc
;==============================================================================
.data
junk real8 ?
.code
;==============================================================================
ShowStatusWord proc
local sw:word
fstsw sw
test sw, 1111111b
jnz @F
printf(".")
@@:
test sw, 0000001b
jz @F
printf("I")
@@:
test sw, 0000010b
jz @F
printf("D")
@@:
test sw, 0000100b
jz @F
printf("Z")
@@:
test sw, 0001000b
jz @F
printf("O")
@@:
test sw, 0010000b
jz @F
printf("U")
@@:
test sw, 0100000b
jz @F
printf("P")
@@:
test sw, 1000000b
jz @F
printf("S")
@@:
ret
ShowStatusWord endp
;==============================================================================
start:
;==============================================================================
;----------------------------------------------------
; The exception flags are identified as follows:
; I = invalid operation
; D = denormalized
; Z = zero divide
; O = overflow
; U = underflow
; P = precision
; S = stack fault
; See Raymond's FPU Tutorial for more information.
;----------------------------------------------------
finit
mov ebx, 20
.while ebx
fmul
call ShowStatusWord
dec ebx
.endw
printf("\n")
finit
mov ebx, 20
.while ebx
fld1
fld1
fmul
call ShowStatusWord
dec ebx
.endw
printf("\n")
finit
mov ebx, 20
.while ebx
fld1
fld1
fmul
fstp junk
call ShowStatusWord
dec ebx
.endw
printf("\n\n")
inkey
exit
;==============================================================================
end start
ISISISISISISISISISISISISISISISISISISISIS
.......ISISISISISISISISISISISISIS
....................
Quote from: qWord on March 04, 2012, 12:42:48 PMI do not know why, but I got the impression that especially assembler programmers seems to be infested by the use-globals-as-much-as-possible pest :bg
Even worse, some combine it with the use-threads-as-much-as-possible pest ::)
Quote from: jj2007 on March 04, 2012, 01:12:13 PM
Quote from: qWord on March 04, 2012, 12:42:48 PMI do not know why, but I got the impression that especially assembler programmers seems to be infested by the use-globals-as-much-as-possible pest :bg
Even worse, some combine it with the use-threads-as-much-as-possible pest ::)
thread safety isn't the only advantage of locals ::)
i think he's talking to me, qWord - lol
i use a thread whenever i need a synchronous function to be asynchronous :bg
i am also probably guilty of using global vars too often
i usually stick things in global vars to start with, then clean them up and make them local, if it is appropriate
sometimes, i get lazy and don't do the clean-up
Quote from: dedndave on March 04, 2012, 06:53:47 PM
i think he's talking to me, qWord - lol
Nope, Dave, that was a pure inner-German teasing :toothy
There is nothing wrong about locals, I also use them as much as appropriate. Or, to quote Einstein, I try to make everything as simple as possible but
not simpler. There are moments when you need globals. There are moments when you need threads, but that's another story ::)
i'll be home next day and try to make a complete procedure to compare it. currently im working on the capital city jakarta.
Sorry Sirs, I'm not familiar with all the SSE stuff.
Is there a little SSE tutorial out there ?
BTW, here are my results:
VertexTimings1
AMD FX(tm)-8150 Eight-Core Processor (SSE4)
10 cycles for Vec_SubSSE
18 cycles for Vec_Sub
189 cycles for 100*fmul
30965 cycles for 100*mulsd
10 cycles for Vec_SubSSE
15 cycles for Vec_Sub
188 cycles for 100*fmul
31100 cycles for 100*mulsd
VertexTimings2
AMD FX(tm)-8150 Eight-Core Processor (SSE4)
9 cycles for Vec_SubSSE
15 cycles for Vec_Sub
186 cycles for 100*fmul
30784 cycles for 100*mulsd movlps
145 cycles for 100*mulsd movsd
9 cycles for Vec_SubSSE
15 cycles for Vec_Sub
186 cycles for 100*fmul
30808 cycles for 100*mulsd movlps
145 cycles for 100*mulsd movsd
Regards
Greenhorn
there are a few out there - google :P
here's one
http://www.tommesani.com/Docs.html
Thanks Dave. :U
...And sorry for my laziness to google it by myself ... :red
Regards
Greenhorn
i need to learn MMX and SSE, myself :bg
MulFPU proc a:real4,b:real4
LOCAL d:dword
fld a
fmul b
fistp d
ret
MulFPU endp
Mulx86 proc a:dword,b:dword
xor edx,edx
mov ecx,a
mov eax,b
mul ecx
ret
Mulx86 endp
Youre right, it has no different. x86 only 16 ms and FPU 29 ms, it has no difference.
no need to zero EDX for MUL :P
also - i find that MUL seems to work better if you can put some unrelated instruction just before
mov ecx,a
mov eax,b
xor edx,edx
mul ecx
what is the sense of comparing a floating point operation with an integer operation?
Quote from: qWord on March 09, 2012, 01:41:23 PM
what is the sense of comparing a floating point operation with an integer operation?
For the graphic optimizer routine, I though it will be faster if we used integer than floating point, but I just saw if it was the same.