Print Page - Who knows and use SSE ?

Title: Who knows and use SSE ?
Post by: mitchi on April 19, 2009, 06:00:55 PM

It seems to me that very few people here know and use SSE in their programs. I could be wrong of course.
As for me, I don't know how to use the FPU nor SSE(any versioN) nor MMX. Am I missing something? Should I learn how to use them?
Most of my programs work with strings and numbers, I've never had any real use for floating point operations in my programs.

Title: Re: Who knows and use SSE ?
Post by: TASMUser on April 19, 2009, 08:20:48 PM

So do I.

SSE/MMX makes only sense if you have to calculate/to handle more than one 8/16/32-bit value at the same time.
In general I get more efficient results with standard ASM instructions, even if I ponder to use MMX/SSE-instructions.
Once you decide MMX/SSE you have to point your whole program/routine to this instruction set and you will get more interface-overhead.

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 19, 2009, 10:22:15 PM

Quote from: mitchi on April 19, 2009, 06:00:55 PM
Most of my programs work with strings and numbers, I've never had any real use for floating point operations in my programs.

Speed might be an argument, even for strings :wink

Code Select

Cycles:
11812   InString, 1, addr Mainstr, addr TestSubX
1966    InstrSSE2, 1, addr Mainstr, addr TestSubD, 0

Another example (http://www.masm32.com/board/index.php?topic=11191.msg83115#msg83115)

Title: Re: Who knows and use SSE ?
Post by: mitchi on April 19, 2009, 11:09:15 PM

Yea, that's a nice speedup! How hard was it for you to learn SSE, compared with the rest?

Title: Re: Who knows and use SSE ?
Post by: NightWare on April 20, 2009, 02:45:50 AM

it's not very difficult when you are well documented, however if you want to learn simd usage, you must define your needs first (coz there is too much instructions, so you should select the appropriate set for your needs). MMX is essentially for gfx/colors manipulation, SSE is essentially for 3D stuff, SSE2 is for both, SSE3+ it's not big improvments

Title: Re: Who knows and use SSE ?
Post by: Alloy on April 22, 2009, 03:25:19 AM

SSE can also be used to store and retrieve data to CPU registers instead of memory. And I use it to handle integers larger than 32 bit.

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 22, 2009, 06:08:12 AM

Quote from: mitchi on April 19, 2009, 11:09:15 PM
Yea, that's a nice speedup! How hard was it for you to learn SSE, compared with the rest?

Not harder than normal assembly; however, there is a confusingly large choice of instructions, many of them doing almost exactly the same. In practice, you can get along with only a few. Check my example from here (http://www.masm32.com/board/index.php?action=dlattach;topic=3601.0;id=6102)- only a handful...

16-byte alignment is a big issue; movups is a replacement for movaps, but a bit slower.

Code Select


option	prologue:none	; no stack frame
option	epilogue:none					
align 16
InstrJJ proc StartPos:DWORD, lpSource:DWORD, lpPattern:DWORD, sMode:DWORD
	push esi
	push edi
	push ebx		; all registers preserved, except eax = return value
	push ebp
	push ecx
	push edx
	mov esi, [esp+6*4+2*4]		; lpSource
	mov edx, [esp+6*4+3*4]		; lpPattern
	movzx eax, word ptr [edx]	; 3 cycles to fill xmm3 with first word
	test ah, ah
	je ByteScan
	imul eax, 00010001h	; propagate loword
	movd xmm3, eax
	pshufd xmm3, xmm3, 0		; xmm3 holds first word of pattern
	mov edi, [edx+2]		; next 4 bytes of pattern
	mov eax, edi
	or ebx, -1
	.if al==0
		xor ebx, ebx		; byte 3 is zero
		mov edi, ebx
	.elseif ah==0
		movzx ebx, bl		; byte 4 is zero
		and edi, ebx
	.else
		shr eax, 16
		.if al==0
			movzx ebx, bx	; byte 5 is zero (= and ebx, 0FFFFh)
		.elseif ah==0
			and ebx, 0ffffffh	; byte 6 is zero 
		.endif
	.endif
	and edi, ebx			; apply mask for bytes 2-5
	test esi, 15			; aligned?
	je L0				; if aligned, clear ebp and go directly into the main loop
	movups xmm1, [esi]		; load 16 bytes from current unaligned address
	movups xmm4, [esi+1]		; load another16 bytes
	mov ebp, esi			; save unaligned address
	and esi, -16			; align esi downwards
	jmp @F
L0:	xor ebp, ebp
L1:	movaps xmm1, [esi]		; load 16 bytes from current aligned address
	movups xmm4, [esi+1]		; load another 16 bytes

@@:	movaps xmm2, xmm1		; save 16 bytes for testing the zero delimiter
	lea esi, [esi+16]		; len counter (moving up/down lea or add costs cycles)
	pcmpeqw xmm1, xmm3		; compare packed words in xmm1 and xmm3 for equality
	pcmpeqb xmm2, xmm1		; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
	pcmpeqw xmm4, xmm3		; compare packed words in xmm4 and xmm3 for equality
	pmovmskb edx, xmm1		; set byte mask in edx for search pattern word
	pmovmskb eax, xmm2		; set byte mask in ecx for zero delimiter byte
	pmovmskb ecx, xmm4		; set byte mask in edx for search pattern word
	shl ecx, 1			; adjust for esi+1 (add ecx, ecx is a lot slower)
	test eax, eax			; zero byte found?
	jnz @F				; check ebp, then ChkNull
	or edx, ecx		; one of them needs to have the word
	jz L1			; 0=no pattern byte found, go back

@@:	test ebp, ebp		; 0=never unaligned, or second loop
	je @F			; ebp=16*n+1....15 ->esi=16*n+16, i.e. esi>ebp
	add ebp, 16
	.if ebp<esi			; at least second loop
		xor ebp, ebp
	.endif
	and ebp, 15

@@:	test eax, eax
	jnz ChkNull

@@:	bsf ecx, edx	; bit scan for the index --------------------------
	lea eax, [esi+ecx-15]
	mov eax, [eax+ebp+1]	; first unaligned chunk contains match
	btr edx, ecx			; clear bit ecx in edx
	and eax, ebx
	cmp eax, edi
	je FoundPattern
BadLuck:
	xor ebp, ebp
	test edx, edx
	jnz @B	; bit scan end ------------------------------------------
	jmp L1	; 0=no more hits in these 16 bytes, go back searching (reversing order is somewhat slower)

ChkNull:
	mov ebx, eax				; position of zero byte
	xor eax, eax				; default: 0=no match
	or edx, ecx					; one of them needs to have the word
	je NoMatch
	bsf ebx, ebx				; nullbyte index in ebx
	bsf ecx, edx				; pattern word index in ecx
	cmp ebx, ecx				; null before pattern word: outta here
	jb NoMatch
	cmp [esi+ecx-14], edi
	jne NoMatch				; first dword after first word doesn't match, so get out

FoundPattern:					; we need to check the complete string here
	test edi, edi				; one-word pattern?
	je Match
	push edi
	push esi
	mov edi, [esp+6*4+3*4+8]	; lpPattern
	lea esi, [esi+ecx-16]
	add esi, ebp

@@:	inc edi
	inc esi
	movzx eax, byte ptr [edi]
	test eax, eax
	je @F
	cmp al, byte ptr [esi]
	je @B

@@:	pop esi
	pop edi
	test eax, eax
	jne BadLuck
Match:
	sub esi, [esp+6*4+2*4]	; lpSource: subtract original src pointer
	lea eax, [esi+ecx-15]		; and adjust for the index
	add eax, ebp			; ebp = offset in first unaligned chunk

NoMatch:
	pop edx		; 6 registers
	pop ecx
	pop ebp
	pop ebx
	pop edi
	pop esi
	ret 4*4		; 4 arguments

ByteScan:
	imul eax, 01010101h			; propagate lobyte
	movd xmm3, eax
	pshufd xmm3, xmm3, 0		; xmm3 holds first word of pattern
@@:	movups xmm1, [esi]			; load 16 bytes from current aligned address
	movaps xmm2, xmm1			; save 16 bytes for testing the zero delimiter
	lea esi, [esi+16]				; len counter (moving up/down lea or add costs cycles)
	pcmpeqb xmm1, xmm3		; compare packed bytes in xmm1 and xmm3 (elephant) for equality
	pcmpeqb xmm2, xmm1		; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
	pmovmskb edx, xmm1		; set byte mask in edx for search pattern word
	pmovmskb ecx, xmm2		; set byte mask in ecx for zero delimiter
	test ecx, ecx					; zero byte found?
	jnz @F							; check ebp, then ChkNull
	test edx, edx					; pattern found?
	jz @B							; 0=no pattern byte found, go back

@@:	xor eax, eax
	bsf ecx, ecx
	bsf edx, edx
	cmp ecx, edx
	ja NoMatch
	lea eax, [esi+edx-15]
	sub eax, [esp+6*4+2*4]
	jmp NoMatch

InstrJJ endp

Title: Re: Who knows and use SSE ?
Post by: Mark Jones on April 22, 2009, 04:52:08 PM

So who is going to lavish us with a series of SSE tutorials? :bg

Title: Re: Who knows and use SSE ?
Post by: lingo on April 22, 2009, 05:27:12 PM

It is so complicated with bad programming style, ugly and as results
works with strings ONLY and it is so slowwwww....shame, shame... :'(
just take a look:

Code Select


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 1 - value expected 37; lenSrchPattern ->22
FJT2 Cresta/IanB,  byte-length shifts: 4294967295 ; clocks: 248
FJT3 Cresta/IanB,  word-length shifts: 4294967295 ; clocks: 202
FJT4 Cresta/IanB, dword-length shifts: 4294967295 ; clocks: 258
Boyer-Moore Lingo, byte-length shifts: 37 ; clocks: 94
Boyer-Moore Lingo, word-length shifts: 37 ; clocks: 120
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 143
InString - JJ:                         38 ; clocks: 98
InString - Lingo:                      37 ; clocks: 42


Search Test 2 - value expected 1007; lenSrchPattern ->17
FJT2 Cresta/IanB,  byte-length shifts: 1007 ; clocks: 9808
FJT3 Cresta/IanB,  word-length shifts: 1007 ; clocks: 9857
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 9820
Boyer-Moore Lingo, byte-length shifts: 1007 ; clocks: 7786
Boyer-Moore Lingo, word-length shifts: 1007 ; clocks: 7832
Boyer-Moore Lingo,dword-length shifts: 1007 ; clocks: 7793
InString - JJ:                         1008 ; clocks: 22619
InString - Lingo:                      1007 ; clocks: 8610

Search Test 3 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 646
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 597
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 662
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 479
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 497
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 528
InString - JJ:                         1009 ; clocks: 715
InString - Lingo:                      1008 ; clocks: 513

Search Test 4 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 2314
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 1461
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2334
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1253
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1279
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1310
InString - JJ:                         1009 ; clocks: 6539
InString - Lingo:                      1008 ; clocks: 4453

Search Test 5 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 2477
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 1681
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2493
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1097
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1113
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1145
InString - JJ:                         1009 ; clocks: 5428
InString - Lingo:                      1008 ; clocks: 4145

Search Test 6 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 760
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 714
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 777
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 580
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 606
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 642
InString - JJ:                         1009 ; clocks: 628
InString - Lingo:                      1008 ; clocks: 513

Search Test 7 - value expected 1009 ;lenSrchPattern ->14
FJT2 Cresta/IanB,  byte-length shifts: 1009 ; clocks: 951
FJT3 Cresta/IanB,  word-length shifts: 1009 ; clocks: 905
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 968
Boyer-Moore Lingo, byte-length shifts: 1009 ; clocks: 767
Boyer-Moore Lingo, word-length shifts: 1009 ; clocks: 792
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 830
InString - JJ:                         1010 ; clocks: 624
InString - Lingo:                      1009 ; clocks: 513

 Press ENTER to exit...

Call the moderators to help you.... :lol

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 22, 2009, 06:15:37 PM

Quote from: lingo on April 22, 2009, 05:27:12 PM
It is so complicated with bad programming style, ugly

Yeah, that's a known problem. When do you finally learn to comment your code??

Quote
and as results works with strings ONLY and it is so slowwwww....shame, shame... :'(

Well, at least my code works just fine with strings, instead of throwing exceptions like yours (http://www.masm32.com/board/index.php?topic=3601.msg83512#msg83512) if no match is found. Furthermore, it works with any pattern length (yours needs 8 bytes minimum, right?), and for normal, i.e. non exotic cases, it is a factor 7-8 faster than the Masm32 library InString. I am a modest person, a factor 7 faster is enough for me :bg

Title: Re: Who knows and use SSE ?
Post by: lingo on April 22, 2009, 07:02:41 PM

"instead of throwing exceptions like yours if no match is found."

Due to the numbers of the result do you want to abuse me? :naughty:
Call the moderators for me because I'm not guilty
that you are impotent to use the code properly :lol
Slowwww... shame..shame :lol

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 22, 2009, 07:43:56 PM

Quote from: lingo on April 22, 2009, 07:02:41 PM
"instead of throwing exceptions like yours if no match is found."

Due to the numbers of the result do you want to abuse me? :naughty:
Call the moderators for me because I'm not guilty
that you are impotent to use the code properly :lol
Slowwww... shame..shame :lol

Hey, my angry young friend, RTFM: The title of the thread you are referring to is "String searching"; from the Masm32 library help file: "InString searches for a substring in a larger string". That is what most of the algos in that thread do successfully. Except yours, which crashes on the rather simple task of (not) finding "duplicate inx" at the end of windows.inc ... ::)

On the positive side, I see that nowadays you have cautiously started to comment your code:

Code Select

BMLinDD proc
...
	movd  mm5, esp	; save esp register
...

Congratulations, Lingo :U Although I have a suspicion that some of the seasoned old hands here might complain that you state the obvious, it is a step in the right direction! :clap:

Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story (http://www.masm32.com/board/index.php?topic=10830.msg79298#msg79298) :bdg

Title: Re: Who knows and use SSE ?
Post by: NightWare on April 22, 2009, 11:40:45 PM

Quote from: Mark Jones on April 22, 2009, 04:52:08 PM
So who is going to lavish us with a series of SSE tutorials? :bg

hi, the usage of sse/sse2 (single/double precision) is quite limited if you don't do 3D stuff, most of the sse/sse2 hints you can see in the algos posted here are just deviance of the normal use of thoses instructions.

and if you do 3D stuff it's essentially math tutorials that are needed, coz matrix*matrix, matrix*vector, conditionnal selection of a vector, transposing matrix, etc... are the essential, and the possible instructions are obvious in this case.

Quote from: lingo on April 22, 2009, 07:02:41 PM
... Slowwww... shame..shame :lol

Quote from: jj2007 on April 22, 2009, 07:43:56 PM
Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story (http://www.masm32.com/board/index.php?topic=10830.msg79298#msg79298) :bdg

hmm..., look like the beginning of a wonderfull love story, i'm just worry... who have planned to meet the taylor for the white dress ? :eek

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 23, 2009, 01:32:30 AM

Quote from: NightWare on April 22, 2009, 11:40:45 PM
most of the sse/sse2 hints you can see in the algos posted here are just deviance of the normal use of thoses instructions.

Correct - this kind of algo is not what SSE2 was originally meant for. But it works :bg
Maybe you can solve a mystery for me; I use movaps+movups for moving integers around in my inner loop:

Code Select

L1: movaps xmm1, [esi]  ; load 16 bytes from current aligned address
movups xmm4, [esi+1] ; load another 16 bytes

@@: movaps xmm2, xmm1 ; save 16 bytes for testing the zero delimiter
lea esi, [esi+16]
...

Tests with the "official" movdqa+movdqu are roughly 2% slower. Intel says (http://software.intel.com/en-us/articles/memcpy-performance/):
"SSE2 movdqu/movdqa instructions were introduced specifically for this purpose. movdqa is suitable for 16-byte aligned operands. movdqu is suitable for fetching byte-aligned groups of 16 bytes from memory, but not useful for storing them.

The Barcelona architecture prefers movaps for stores. movaps, movdqa, and movapd are functionally equivalent, with movaps having shorter encoding."

I saw mentioned that they use different units (http://objectmix.com/asm-x86-asm-370/377453-population-count-sse2-again-3.html), which might explain the speed difference. But is there any reason not to use the fastest variant??

Quote
Quote from: lingo on April 22, 2009, 07:02:41 PM
... Slowwww... shame..shame :lol
Quote from: jj2007 on April 22, 2009, 07:43:56 PM
Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story (http://www.masm32.com/board/index.php?topic=10830.msg79298#msg79298) :bdg
hmm..., look like the beginning of a wonderfull love story, i'm just worry... who have planned to meet the taylor for the white dress ? :eek

No such plans. She has not even sent her photo :tdown

Title: Re: Who knows and use SSE ?
Post by: NightWare on April 23, 2009, 01:57:36 AM

Quote from: jj2007 on April 23, 2009, 01:32:30 AM
I saw mentioned that they use different units, which might explain the speed difference. But is there any reason not to use the fastest variant??

the port used by the instruction can speedup/slowdown things, it depends of the other instructions of the algo. now concerning the fastest movdqu/movdqa it make sens, see the instruction like 2*64 bits compared to 4*32 bits... so the loop is divided by 2... :wink

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 23, 2009, 02:10:08 AM

Quote from: NightWare on April 23, 2009, 01:57:36 AMnow concerning the fastest movdqu/movdqa it make sens, see the instruction like 2*64 bits compared to 4*32 bits... so the loop is divided by 2... :wink

But movdqa is actually slower than movaps... ::)

Title: Re: Who knows and use SSE ?
Post by: lingo on April 23, 2009, 05:53:52 AM

"But movdqa is actually slower than movaps."

again nonsense...and two tests more:

Code Select


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 8 - value expected 22646 ;lenSrchPattern ->7260(1C5Ch)
Boyer-Moore Lingo, word-length shifts: 22646 ; clocks: 24483
Boyer-Moore Lingo,dword-length shifts: 22646 ; clocks: 27209
InString - JJ:                         22647 ; clocks: 36692
InString - Lingo:                      22646 ; clocks: 19656


Search Test 9 -Find 'Duplicate inc' in 'windows.inc' ; lenSrchPattern ->13
Boyer-Moore Lingo, word-length shifts: 1127624 ; clocks: 898528
Boyer-Moore Lingo,dword-length shifts: 1127624 ; clocks: 898721
InString - JJ:                         1127625 ; clocks: 680112
InString - Lingo:                      1127624 ; clocks: 561030

 Press ENTER to exit...

Slowwww again and ...shame, shame... :lol

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 23, 2009, 02:36:00 PM

Quote from: lingo on April 23, 2009, 05:53:52 AM
"But movdqa is actually slower than movaps."
again nonsense...

As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. That is what I did, and on a Celeron M aps+ups are slightly faster. On a Prescott P4, they seem to be roughly equivalent; both statements refer to the inner loop of the InstringJJ posted above.

What is really odd, though, is the performance of mov xmm1, xmm2 on a P4.
Here are the timings for a Celeron M:

Code Select

Aligned, mem to xmm:
7       cycles for 4* movaps
7       cycles for 4* movapd
7       cycles for 4* movdqa

Unaligned, mem to xmm:
13      cycles for 4* movups
13      cycles for 4* movupd
13      cycles for 4* movdqu

Aligned, xmm to xmm:
4       cycles for 4* movaps
4       cycles for 4* movapd
4       cycles for 4* movdqa

Aligned, xmm to MEM to xmm:
15      cycles for 4* movaps
15      cycles for 4* movapd
15      cycles for 4* movdqa

And here the P4:

Code Select

Aligned, mem to xmm:
4       cycles for 4* movaps
3       cycles for 4* movapd
3       cycles for 4* movdqa

Unaligned, mem to xmm:
25      cycles for 4* movups
26      cycles for 4* movupd
25      cycles for 4* movdqu

Aligned, xmm to xmm:
27      cycles for 4* movaps         <---------------------------------
27      cycles for 4* movapd
27      cycles for 4* movdqa

Aligned, xmm to MEM to xmm:
17      cycles for 4* movaps         <---------------------------------
17      cycles for 4* movapd
17      cycles for 4* movdqa

Surprisingly,
   movdqa [esi], xmm0
   movdqa xmm1, [esi]
is faster than a simple
   movdqa xmm1, xmm0

For the aficionados, I attach the testbed. I could not reproduce that speed gain in a real life algo (the inner loop of the InstringJJ posted above).

[attachment deleted by admin]

Title: Re: Who knows and use SSE ?
Post by: lingo on April 23, 2009, 04:39:37 PM

"As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. "

I did it in some years ago and my results are similar to link.. (http://flashlight.slad.cz/files/memtest_p4_northwood.txt)
The opposite "your" information is a steal from generic optimization of memcpy() here..] (http://74.125.95.132/search?q=cache:LUF0KefzXzEJ:software.intel.com/en-us/articles/memcpy-performance/+movdqa+movaps&cd=10&hl=en&ct=clnk&c)

Hence, "I could not reproduce that speed gain in a real life algo" is a proof that
"But movdqa is actually slower than movaps." is a nonsense
Hence, nonsense here is just the true rather than an 'emotional abuse'... :lol

Title: Re: Who knows and use SSE ?
Post by: d0d0 on April 23, 2009, 06:34:10 PM

Quote from: Mark Jones on April 22, 2009, 04:52:08 PM
So who is going to lavish us with a series of SSE tutorials? :bg

Please, please team up and give us some tutorials :bg

Imagine all that energy/brain power combined :dazzled:

Title: Re: Who knows and use SSE ?
Post by: Rainstorm on April 23, 2009, 06:55:45 PM

mark jones wrote. . .

QuoteSo who is going to lavish us with a series of SSE tutorials? BigGrin

yes.. that would be good.

There are some links to reference material & a tute on SSE somewhere on the forum I think

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 23, 2009, 08:54:59 PM

Quote from: lingo on April 23, 2009, 04:39:37 PM
"As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. "

I did it in some years ago and my results are similar to link.. (http://flashlight.slad.cz/files/memtest_p4_northwood.txt)
The opposite "your" information is a steal from generic optimization of memcpy() here..] (http://74.125.95.132/search?q=cache:LUF0KefzXzEJ:software.intel.com/en-us/articles/memcpy-performance/+movdqa+movaps&cd=10&hl=en&ct=clnk&c)

Hence, "I could not reproduce that speed gain in a real life algo" is a proof that
"But movdqa is actually slower than movaps." is a nonsense
Hence, nonsense here is just the true rather than an 'emotional abuse'... :lol

First, you seem to have a serious problem with the concepts of "stealing" and "quoting".
Second, when you post a link, read it at least carefully before crying nonsense:

QuoteMemory copy routines tester by Petr Supina, 2005-2006
Block size: 2 x 16 Bytes
Method    Time [ns]
------    ---------
movaps:    6.71704
movdqa:    6.75457

Block size: 2 x 256 Bytes
Method    Time [ns]
movaps:    38.7073
movdqa:    38.7663

That pattern changes for larger block sizes, but your link proves that my observation was correct.
Thank you :U

And just for fun, here the results for my Celeron M, obtained with the Supina software via your link:

Code Select

Block size: 2 x 16 Bytes
Method          Time [ns]
------          ---------
movaps:         11.0079
movdqa:         11.1622

Block size: 2 x 256 Bytes
Method          Time [ns]
movaps:         37.8871
movdqa:         39.8011

Block size: 2 x 4096 Bytes
Method          Time [us]
------          ---------
movaps:         0.552621
movdqa:         0.556335

Good night Lingo :8)

Title: Re: Who knows and use SSE ?
Post by: NightWare on April 23, 2009, 09:30:52 PM

Quote from: jj2007 on April 23, 2009, 02:10:08 AM
But movdqa is actually slower than movaps... ::)

no, it CAN be slower, but because of other factors, like a different alignment of the following instructions (coz code alignment count), otherwise the extra byte read is quickly absorbed by the loop, once the algo is in the cache. it can also depends of the number of uops for the port, AND the normal number of uops treated by your cpu.

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 24, 2009, 04:20:49 AM

Quote from: NightWare on April 23, 2009, 09:30:52 PM
Quote from: jj2007 on April 23, 2009, 02:10:08 AM
But movdqa is actually slower than movaps... ::)
no, it CAN be slower, but because of other factors, like a different alignment of the following instructions (coz code alignment count), otherwise the extra byte read is quickly absorbed by the loop, once the algo is in the cache. it can also depends of the number of uops for the port, AND the normal number of uops treated by your cpu.

Agreed, although code alignment plays a surprisingly small role on modern CPU's. I have tested that with a
REPEAT n
nop
ENDM
right before my inner loop, and the effect is nil on my Celeron, and small on my P4.
Another problem is the consistency of timings. I ran another test with this software (http://flashlight.slad.cz/files/memtest-2006-04-12.zip), see attachment - outliers everywhere, and the supplied 6-digits precision is clearly misleading.

[attachment deleted by admin]

Title: Re: Who knows and use SSE ?
Post by: Draakie on April 24, 2009, 11:26:14 AM

http://www.masm32.com/board/index.php?topic=8498.0

Needs an update on more sites and report on dead links.

Draakie

Title: Re: Who knows and use SSE ?
Post by: lingo on April 24, 2009, 01:30:24 PM

"although code alignment plays a surprisingly small role on modern CPU's."

nonsense again...
- If I'm not wrong you have no experience with modern CPUs because your CPUs are still archaic

" I have tested that with a
REPEAT n
nop
ENDM
right before my inner loop, and the effect is nil on my Celeron, and small on my P4."

You are the champion in slow code, hence your code is the slowest before to start your test
It is the reason that after the test it can't be slower... :lol

I know that you can't control your emotions but you can try your test with my code... :lol

Title: Re: Who knows and use SSE ?
Post by: d0d0 on April 24, 2009, 02:00:18 PM

Quote from: Draakie on April 24, 2009, 11:26:14 AM
http://www.masm32.com/board/index.php?topic=8498.0

Needs an update on more sites and report on dead links.

Draakie

Thanks mate

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 24, 2009, 03:47:03 PM

Quote from: lingo on April 24, 2009, 01:30:24 PM

I know that you can't control your emotions but you can try your test with my code... :lol

My PC hates exceptions :bg

Title: Re: Who knows and use SSE ?
Post by: lingo on April 24, 2009, 04:19:33 PM

OK, no offend but from your code I see that you are still a mad code pilferer and newbie spaghetti code creator without any ideas and experience in programming
Due to your age and lack of interest to learn new things from other's experience ( A.Fog, etc.) you will stay mad newbie (with level and interests like this (http://www.masm32.com/board/index.php?topic=7653.0) or this (http://www.masm32.com/board/index.php?topic=10780.msg78979#msg78979) or this (http://www.masm32.com/board/index.php?topic=9383.0) or this (http://www.masm32.com/board/index.php?topic=9782.0) etc.) until end of your life. It is the reason that I loose interest in and don't want to loose my time for people like you.. :tdown Sorry mad leaky watering-pot and don't forget to get your medicine now! :lol

Title: Re: Who knows and use SSE ?
Post by: jj2007 on April 25, 2009, 08:04:03 AM

Quote from: Draakie on April 24, 2009, 11:26:14 AM
http://www.masm32.com/board/index.php?topic=8498.0

Needs an update on more sites and report on dead links.

Draakie

Thanks, Draakie. The neilkemp and dennishome links are more or less dead, while http://www.tommesani.com/Docs.html is still one of the better sources. Jorgon (http://www.jorgon.freeserve.co.uk/TestbugHelp/XMMintins.htm) has a good intro, too.

An excellent complete reference is here (http://www.ews.uiuc.edu/~cjiang/reference/index.htm).

What I really miss is an in-depth discussion of what exactly are the rules for using/mixing the float and integer instructions. Some sources say that movaps, movapd and movdqa are "functionally equivalent"... so why do we need them all?

Title: Re: Who knows and use SSE ?
Post by: Rainstorm on April 25, 2009, 08:20:01 AM

.. an alt to the dead link

http://www.neilkemp.us/v4/articles/sse_tutorial/sse_tutorial.html

and another..

http://softpixel.com/~cwright/programming/simd/mmx.php

Title: Re: Who knows and use SSE ?
Post by: mitchi on April 25, 2009, 02:50:52 PM

Quote from: Rainstorm on April 25, 2009, 08:20:01 AM
.. an alt to the dead link

http://www.neilkemp.us/v4/articles/sse_tutorial/sse_tutorial.html

and another..

http://softpixel.com/~cwright/programming/simd/mmx.php

Thank you very much !

Title: Re: Who knows and use SSE ?
Post by: Alloy on April 26, 2009, 02:09:28 PM

http://www.masm32.com/board/index.php?topic=782.0

It might be a dead link too and only ran on 32 windows.

The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: mitchi on April 19, 2009, 06:00:55 PM