Who knows and use SSE ?

jj2007 · April 23, 2009, 02:10:08 AM

Quote from: NightWare on April 23, 2009, 01:57:36 AMnow concerning the fastest movdqu/movdqa it make sens, see the instruction like 2*64 bits compared to 4*32 bits... so the loop is divided by 2... :wink

But movdqa is actually slower than movaps... ::)

lingo · April 23, 2009, 05:53:52 AM

"But movdqa is actually slower than movaps."

again nonsense...and two tests more:

Code Select


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 8 - value expected 22646 ;lenSrchPattern ->7260(1C5Ch)
Boyer-Moore Lingo, word-length shifts: 22646 ; clocks: 24483
Boyer-Moore Lingo,dword-length shifts: 22646 ; clocks: 27209
InString - JJ:                         22647 ; clocks: 36692
InString - Lingo:                      22646 ; clocks: 19656


Search Test 9 -Find 'Duplicate inc' in 'windows.inc' ; lenSrchPattern ->13
Boyer-Moore Lingo, word-length shifts: 1127624 ; clocks: 898528
Boyer-Moore Lingo,dword-length shifts: 1127624 ; clocks: 898721
InString - JJ:                         1127625 ; clocks: 680112
InString - Lingo:                      1127624 ; clocks: 561030

 Press ENTER to exit...

Slowwww again and ...shame, shame... :lol

jj2007 · April 23, 2009, 02:36:00 PM

Quote from: lingo on April 23, 2009, 05:53:52 AM
"But movdqa is actually slower than movaps."
again nonsense...

As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. That is what I did, and on a Celeron M aps+ups are slightly faster. On a Prescott P4, they seem to be roughly equivalent; both statements refer to the inner loop of the InstringJJ posted above.

What is really odd, though, is the performance of mov xmm1, xmm2 on a P4.
Here are the timings for a Celeron M:

Code Select

Aligned, mem to xmm:
7       cycles for 4* movaps
7       cycles for 4* movapd
7       cycles for 4* movdqa

Unaligned, mem to xmm:
13      cycles for 4* movups
13      cycles for 4* movupd
13      cycles for 4* movdqu

Aligned, xmm to xmm:
4       cycles for 4* movaps
4       cycles for 4* movapd
4       cycles for 4* movdqa

Aligned, xmm to MEM to xmm:
15      cycles for 4* movaps
15      cycles for 4* movapd
15      cycles for 4* movdqa

And here the P4:

Code Select

Aligned, mem to xmm:
4       cycles for 4* movaps
3       cycles for 4* movapd
3       cycles for 4* movdqa

Unaligned, mem to xmm:
25      cycles for 4* movups
26      cycles for 4* movupd
25      cycles for 4* movdqu

Aligned, xmm to xmm:
27      cycles for 4* movaps         <---------------------------------
27      cycles for 4* movapd
27      cycles for 4* movdqa

Aligned, xmm to MEM to xmm:
17      cycles for 4* movaps         <---------------------------------
17      cycles for 4* movapd
17      cycles for 4* movdqa

Surprisingly,
   movdqa [esi], xmm0
   movdqa xmm1, [esi]
is faster than a simple
   movdqa xmm1, xmm0

For the aficionados, I attach the testbed. I could not reproduce that speed gain in a real life algo (the inner loop of the InstringJJ posted above).

[attachment deleted by admin]

lingo · April 23, 2009, 04:39:37 PM

"As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. "

I did it in some years ago and my results are similar to link..
The opposite "your" information is a steal from generic optimization of memcpy() here..]

Hence, "I could not reproduce that speed gain in a real life algo" is a proof that
"But movdqa is actually slower than movaps." is a nonsense
Hence, nonsense here is just the true rather than an 'emotional abuse'... :lol

d0d0 · April 23, 2009, 06:34:10 PM

Quote from: Mark Jones on April 22, 2009, 04:52:08 PM
So who is going to lavish us with a series of SSE tutorials? :bg

Please, please team up and give us some tutorials :bg

Imagine all that energy/brain power combined :dazzled:

Rainstorm · April 23, 2009, 06:55:45 PM

mark jones wrote. . .

QuoteSo who is going to lavish us with a series of SSE tutorials? BigGrin

yes.. that would be good.

There are some links to reference material & a tute on SSE somewhere on the forum I think

jj2007 · April 23, 2009, 08:54:59 PM

Quote from: lingo on April 23, 2009, 04:39:37 PM
"As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. "

I did it in some years ago and my results are similar to link..
The opposite "your" information is a steal from generic optimization of memcpy() here..]

Hence, "I could not reproduce that speed gain in a real life algo" is a proof that
"But movdqa is actually slower than movaps." is a nonsense
Hence, nonsense here is just the true rather than an 'emotional abuse'... :lol

First, you seem to have a serious problem with the concepts of "stealing" and "quoting".
Second, when you post a link, read it at least carefully before crying nonsense:

QuoteMemory copy routines tester by Petr Supina, 2005-2006
Block size: 2 x 16 Bytes
Method    Time [ns]
------    ---------
movaps:    6.71704
movdqa:    6.75457

Block size: 2 x 256 Bytes
Method    Time [ns]
movaps:    38.7073
movdqa:    38.7663

That pattern changes for larger block sizes, but your link proves that my observation was correct.
Thank you :U

And just for fun, here the results for my Celeron M, obtained with the Supina software via your link:

Code Select

Block size: 2 x 16 Bytes
Method          Time [ns]
------          ---------
movaps:         11.0079
movdqa:         11.1622

Block size: 2 x 256 Bytes
Method          Time [ns]
movaps:         37.8871
movdqa:         39.8011

Block size: 2 x 4096 Bytes
Method          Time [us]
------          ---------
movaps:         0.552621
movdqa:         0.556335

Good night Lingo :8)

NightWare · April 23, 2009, 09:30:52 PM

Quote from: jj2007 on April 23, 2009, 02:10:08 AM
But movdqa is actually slower than movaps... ::)

no, it CAN be slower, but because of other factors, like a different alignment of the following instructions (coz code alignment count), otherwise the extra byte read is quickly absorbed by the loop, once the algo is in the cache. it can also depends of the number of uops for the port, AND the normal number of uops treated by your cpu.

jj2007 · April 24, 2009, 04:20:49 AM

Quote from: NightWare on April 23, 2009, 09:30:52 PM
Quote from: jj2007 on April 23, 2009, 02:10:08 AM
But movdqa is actually slower than movaps... ::)
no, it CAN be slower, but because of other factors, like a different alignment of the following instructions (coz code alignment count), otherwise the extra byte read is quickly absorbed by the loop, once the algo is in the cache. it can also depends of the number of uops for the port, AND the normal number of uops treated by your cpu.

Agreed, although code alignment plays a surprisingly small role on modern CPU's. I have tested that with a
REPEAT n
nop
ENDM
right before my inner loop, and the effect is nil on my Celeron, and small on my P4.
Another problem is the consistency of timings. I ran another test with this software, see attachment - outliers everywhere, and the supplied 6-digits precision is clearly misleading.

[attachment deleted by admin]

Draakie · April 24, 2009, 11:26:14 AM

http://www.masm32.com/board/index.php?topic=8498.0

Needs an update on more sites and report on dead links.

Draakie

lingo · April 24, 2009, 01:30:24 PM

"although code alignment plays a surprisingly small role on modern CPU's."

nonsense again...
- If I'm not wrong you have no experience with modern CPUs because your CPUs are still archaic

" I have tested that with a
REPEAT n
nop
ENDM
right before my inner loop, and the effect is nil on my Celeron, and small on my P4."

You are the champion in slow code, hence your code is the slowest before to start your test
It is the reason that after the test it can't be slower... :lol

I know that you can't control your emotions but you can try your test with my code... :lol

d0d0 · April 24, 2009, 02:00:18 PM

Quote from: Draakie on April 24, 2009, 11:26:14 AM
http://www.masm32.com/board/index.php?topic=8498.0

Needs an update on more sites and report on dead links.

Draakie

Thanks mate

jj2007 · April 24, 2009, 03:47:03 PM

Quote from: lingo on April 24, 2009, 01:30:24 PM

I know that you can't control your emotions but you can try your test with my code... :lol

My PC hates exceptions :bg

lingo · April 24, 2009, 04:19:33 PM

OK, no offend but from your code I see that you are still a mad code pilferer and newbie spaghetti code creator without any ideas and experience in programming
Due to your age and lack of interest to learn new things from other's experience ( A.Fog, etc.) you will stay mad newbie (with level and interests like this or this or this or this etc.) until end of your life. It is the reason that I loose interest in and don't want to loose my time for people like you.. :tdown Sorry mad leaky watering-pot and don't forget to get your medicine now! :lol

jj2007 · April 25, 2009, 08:04:03 AM

Quote from: Draakie on April 24, 2009, 11:26:14 AM
http://www.masm32.com/board/index.php?topic=8498.0

Needs an update on more sites and report on dead links.

Draakie

Thanks, Draakie. The neilkemp and dennishome links are more or less dead, while http://www.tommesani.com/Docs.html is still one of the better sources. Jorgon has a good intro, too.

An excellent complete reference is here.

What I really miss is an in-depth discussion of what exactly are the rules for using/mixing the float and integer instructions. Some sources say that movaps, movapd and movdqa are "functionally equivalent"... so why do we need them all?

News:

Who knows and use SSE ?

d0d0

d0d0