News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Who knows and use SSE ?

Started by mitchi, April 19, 2009, 06:00:55 PM

Previous topic - Next topic

mitchi

It seems to me that very few people here know and use SSE in their programs. I could be wrong of course.
As for me, I don't know how to use the FPU nor SSE(any versioN) nor MMX. Am I missing something? Should I learn how to use them?
Most of my programs work with strings and numbers, I've never had any real use for floating point operations in my programs.

TASMUser

So do I.

SSE/MMX makes only sense if you have to calculate/to handle more than one 8/16/32-bit value at the same time.
In general I get more efficient results with standard ASM instructions, even if I ponder to use MMX/SSE-instructions.
Once you decide MMX/SSE you have to point your whole program/routine to this instruction set and you will get more interface-overhead.

jj2007

Quote from: mitchi on April 19, 2009, 06:00:55 PM
Most of my programs work with strings and numbers, I've never had any real use for floating point operations in my programs.

Speed might be an argument, even for strings :wink

Cycles:
11812   InString, 1, addr Mainstr, addr TestSubX
1966    InstrSSE2, 1, addr Mainstr, addr TestSubD, 0


Another example

mitchi

Yea, that's a nice speedup! How hard was it for you to learn SSE, compared with the rest?

NightWare

it's not very difficult when you are well documented, however if you want to learn simd usage, you must define your needs first (coz there is too much instructions, so you should select the appropriate set for your needs). MMX is essentially for gfx/colors manipulation, SSE is essentially for 3D stuff, SSE2 is for both, SSE3+ it's not big improvments

Alloy

SSE can also be used to store and retrieve data to CPU registers instead of memory. And I use it to handle integers larger than 32 bit.
We all used to be something else. Nature has always recycled.

jj2007

Quote from: mitchi on April 19, 2009, 11:09:15 PM
Yea, that's a nice speedup! How hard was it for you to learn SSE, compared with the rest?

Not harder than normal assembly; however, there is a confusingly large choice of instructions, many of them doing almost exactly the same. In practice, you can get along with only a few. Check my example from here - only a handful...

16-byte alignment is a big issue; movups is a replacement for movaps, but a bit slower.



option prologue:none ; no stack frame
option epilogue:none
align 16
InstrJJ proc StartPos:DWORD, lpSource:DWORD, lpPattern:DWORD, sMode:DWORD
push esi
push edi
push ebx ; all registers preserved, except eax = return value
push ebp
push ecx
push edx
mov esi, [esp+6*4+2*4] ; lpSource
mov edx, [esp+6*4+3*4] ; lpPattern
movzx eax, word ptr [edx] ; 3 cycles to fill xmm3 with first word
test ah, ah
je ByteScan
imul eax, 00010001h ; propagate loword
movd xmm3, eax
pshufd xmm3, xmm3, 0 ; xmm3 holds first word of pattern
mov edi, [edx+2] ; next 4 bytes of pattern
mov eax, edi
or ebx, -1
.if al==0
xor ebx, ebx ; byte 3 is zero
mov edi, ebx
.elseif ah==0
movzx ebx, bl ; byte 4 is zero
and edi, ebx
.else
shr eax, 16
.if al==0
movzx ebx, bx ; byte 5 is zero (= and ebx, 0FFFFh)
.elseif ah==0
and ebx, 0ffffffh ; byte 6 is zero
.endif
.endif
and edi, ebx ; apply mask for bytes 2-5
test esi, 15 ; aligned?
je L0 ; if aligned, clear ebp and go directly into the main loop
movups xmm1, [esi] ; load 16 bytes from current unaligned address
movups xmm4, [esi+1] ; load another16 bytes
mov ebp, esi ; save unaligned address
and esi, -16 ; align esi downwards
jmp @F
L0: xor ebp, ebp
L1: movaps xmm1, [esi] ; load 16 bytes from current aligned address
movups xmm4, [esi+1] ; load another 16 bytes

@@: movaps xmm2, xmm1 ; save 16 bytes for testing the zero delimiter
lea esi, [esi+16] ; len counter (moving up/down lea or add costs cycles)
pcmpeqw xmm1, xmm3 ; compare packed words in xmm1 and xmm3 for equality
pcmpeqb xmm2, xmm1 ; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
pcmpeqw xmm4, xmm3 ; compare packed words in xmm4 and xmm3 for equality
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern word
pmovmskb eax, xmm2 ; set byte mask in ecx for zero delimiter byte
pmovmskb ecx, xmm4 ; set byte mask in edx for search pattern word
shl ecx, 1 ; adjust for esi+1 (add ecx, ecx is a lot slower)
test eax, eax ; zero byte found?
jnz @F ; check ebp, then ChkNull
or edx, ecx ; one of them needs to have the word
jz L1 ; 0=no pattern byte found, go back

@@: test ebp, ebp ; 0=never unaligned, or second loop
je @F ; ebp=16*n+1....15 ->esi=16*n+16, i.e. esi>ebp
add ebp, 16
.if ebp<esi ; at least second loop
xor ebp, ebp
.endif
and ebp, 15

@@: test eax, eax
jnz ChkNull

@@: bsf ecx, edx ; bit scan for the index --------------------------
lea eax, [esi+ecx-15]
mov eax, [eax+ebp+1] ; first unaligned chunk contains match
btr edx, ecx ; clear bit ecx in edx
and eax, ebx
cmp eax, edi
je FoundPattern
BadLuck:
xor ebp, ebp
test edx, edx
jnz @B ; bit scan end ------------------------------------------
jmp L1 ; 0=no more hits in these 16 bytes, go back searching (reversing order is somewhat slower)

ChkNull:
mov ebx, eax ; position of zero byte
xor eax, eax ; default: 0=no match
or edx, ecx ; one of them needs to have the word
je NoMatch
bsf ebx, ebx ; nullbyte index in ebx
bsf ecx, edx ; pattern word index in ecx
cmp ebx, ecx ; null before pattern word: outta here
jb NoMatch
cmp [esi+ecx-14], edi
jne NoMatch ; first dword after first word doesn't match, so get out

FoundPattern: ; we need to check the complete string here
test edi, edi ; one-word pattern?
je Match
push edi
push esi
mov edi, [esp+6*4+3*4+8] ; lpPattern
lea esi, [esi+ecx-16]
add esi, ebp

@@: inc edi
inc esi
movzx eax, byte ptr [edi]
test eax, eax
je @F
cmp al, byte ptr [esi]
je @B

@@: pop esi
pop edi
test eax, eax
jne BadLuck
Match:
sub esi, [esp+6*4+2*4] ; lpSource: subtract original src pointer
lea eax, [esi+ecx-15] ; and adjust for the index
add eax, ebp ; ebp = offset in first unaligned chunk

NoMatch:
pop edx ; 6 registers
pop ecx
pop ebp
pop ebx
pop edi
pop esi
ret 4*4 ; 4 arguments

ByteScan:
imul eax, 01010101h ; propagate lobyte
movd xmm3, eax
pshufd xmm3, xmm3, 0 ; xmm3 holds first word of pattern
@@: movups xmm1, [esi] ; load 16 bytes from current aligned address
movaps xmm2, xmm1 ; save 16 bytes for testing the zero delimiter
lea esi, [esi+16] ; len counter (moving up/down lea or add costs cycles)
pcmpeqb xmm1, xmm3 ; compare packed bytes in xmm1 and xmm3 (elephant) for equality
pcmpeqb xmm2, xmm1 ; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern word
pmovmskb ecx, xmm2 ; set byte mask in ecx for zero delimiter
test ecx, ecx ; zero byte found?
jnz @F ; check ebp, then ChkNull
test edx, edx ; pattern found?
jz @B ; 0=no pattern byte found, go back

@@: xor eax, eax
bsf ecx, ecx
bsf edx, edx
cmp ecx, edx
ja NoMatch
lea eax, [esi+edx-15]
sub eax, [esp+6*4+2*4]
jmp NoMatch

InstrJJ endp



Mark Jones

So who is going to lavish us with a series of SSE tutorials? :bg
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

lingo

It is so complicated with bad programming style, ugly and as results
works with strings ONLY and it is so  slowwwww....shame, shame... :'(
just take a look:
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 1 - value expected 37; lenSrchPattern ->22
FJT2 Cresta/IanB,  byte-length shifts: 4294967295 ; clocks: 248
FJT3 Cresta/IanB,  word-length shifts: 4294967295 ; clocks: 202
FJT4 Cresta/IanB, dword-length shifts: 4294967295 ; clocks: 258
Boyer-Moore Lingo, byte-length shifts: 37 ; clocks: 94
Boyer-Moore Lingo, word-length shifts: 37 ; clocks: 120
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 143
InString - JJ:                         38 ; clocks: 98
InString - Lingo:                      37 ; clocks: 42


Search Test 2 - value expected 1007; lenSrchPattern ->17
FJT2 Cresta/IanB,  byte-length shifts: 1007 ; clocks: 9808
FJT3 Cresta/IanB,  word-length shifts: 1007 ; clocks: 9857
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 9820
Boyer-Moore Lingo, byte-length shifts: 1007 ; clocks: 7786
Boyer-Moore Lingo, word-length shifts: 1007 ; clocks: 7832
Boyer-Moore Lingo,dword-length shifts: 1007 ; clocks: 7793
InString - JJ:                         1008 ; clocks: 22619
InString - Lingo:                      1007 ; clocks: 8610

Search Test 3 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 646
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 597
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 662
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 479
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 497
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 528
InString - JJ:                         1009 ; clocks: 715
InString - Lingo:                      1008 ; clocks: 513

Search Test 4 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 2314
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 1461
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2334
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1253
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1279
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1310
InString - JJ:                         1009 ; clocks: 6539
InString - Lingo:                      1008 ; clocks: 4453

Search Test 5 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 2477
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 1681
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2493
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1097
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1113
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1145
InString - JJ:                         1009 ; clocks: 5428
InString - Lingo:                      1008 ; clocks: 4145

Search Test 6 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 760
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 714
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 777
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 580
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 606
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 642
InString - JJ:                         1009 ; clocks: 628
InString - Lingo:                      1008 ; clocks: 513

Search Test 7 - value expected 1009 ;lenSrchPattern ->14
FJT2 Cresta/IanB,  byte-length shifts: 1009 ; clocks: 951
FJT3 Cresta/IanB,  word-length shifts: 1009 ; clocks: 905
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 968
Boyer-Moore Lingo, byte-length shifts: 1009 ; clocks: 767
Boyer-Moore Lingo, word-length shifts: 1009 ; clocks: 792
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 830
InString - JJ:                         1010 ; clocks: 624
InString - Lingo:                      1009 ; clocks: 513

Press ENTER to exit...

Call the moderators to help you.... :lol


jj2007

Quote from: lingo on April 22, 2009, 05:27:12 PM
It is so complicated with bad programming style, ugly
Yeah, that's a known problem. When do you finally learn to comment your code??

Quote
and as results works with strings ONLY and it is so  slowwwww....shame, shame... :'(

Well, at least my code works just fine with strings, instead of throwing exceptions like yours if no match is found. Furthermore, it works with any pattern length (yours needs 8 bytes minimum, right?), and for normal, i.e. non exotic cases, it is a factor 7-8 faster than the Masm32 library InString. I am a modest person, a factor 7 faster is enough for me :bg

lingo

"instead of throwing exceptions like yours if no match is found."

Due to the numbers of the result do you want to abuse me?  :naughty:
Call the moderators for me because I'm not guilty
that  you are impotent to use the code properly  :lol
Slowwww... shame..shame  :lol


jj2007

Quote from: lingo on April 22, 2009, 07:02:41 PM
"instead of throwing exceptions like yours if no match is found."

Due to the numbers of the result do you want to abuse me?  :naughty:
Call the moderators for me because I'm not guilty
that  you are impotent to use the code properly  :lol
Slowwww... shame..shame  :lol


Hey, my angry young friend, RTFM: The title of the thread you are referring to is "String searching"; from the Masm32 library help file: "InString searches for a substring in a larger string". That is what most of the algos in that thread do successfully. Except yours, which crashes on the rather simple task of (not) finding "duplicate inx" at the end of windows.inc ... ::)

On the positive side, I see that nowadays you have cautiously started to comment your code:

BMLinDD proc
...
movd  mm5, esp ; save esp register
...


Congratulations, Lingo :U Although I have a suspicion that some of the seasoned old hands here might complain that you state the obvious, it is a step in the right direction! :clap:

Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story :bdg

NightWare

Quote from: Mark Jones on April 22, 2009, 04:52:08 PM
So who is going to lavish us with a series of SSE tutorials? :bg
hi, the usage of sse/sse2 (single/double precision) is quite limited if you don't do 3D stuff, most of the sse/sse2 hints you can see in the algos posted here are just deviance of the normal use of thoses instructions.

and if you do 3D stuff it's essentially math tutorials that are needed, coz matrix*matrix, matrix*vector, conditionnal selection of a vector, transposing matrix, etc... are the essential, and the possible instructions are obvious in this case.

Quote from: lingo on April 22, 2009, 07:02:41 PM
... Slowwww... shame..shame  :lol
Quote from: jj2007 on April 22, 2009, 07:43:56 PM
Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story :bdg
hmm..., look like the beginning of a wonderfull love story, i'm just worry... who have planned to meet the taylor for the white dress ?  :eek

jj2007

Quote from: NightWare on April 22, 2009, 11:40:45 PM
most of the sse/sse2 hints you can see in the algos posted here are just deviance of the normal use of thoses instructions.
Correct - this kind of algo is not what SSE2 was originally meant for. But it works :bg
Maybe you can solve a mystery for me; I use movaps+movups for moving integers around in my inner loop:

L1: movaps xmm1, [esi]  ; load 16 bytes from current aligned address
movups xmm4, [esi+1] ; load another 16 bytes

@@: movaps xmm2, xmm1 ; save 16 bytes for testing the zero delimiter
lea esi, [esi+16]
...

Tests with the "official" movdqa+movdqu are roughly 2% slower. Intel says:
"SSE2 movdqu/movdqa instructions were introduced specifically for this purpose.  movdqa is suitable for 16-byte aligned operands.  movdqu is suitable for fetching byte-aligned groups of 16 bytes from memory, but not useful for storing them.

The Barcelona architecture prefers movaps for stores.  movaps, movdqa, and movapd are functionally equivalent, with movaps having shorter encoding."

I saw mentioned that they use different units, which might explain the speed difference. But is there any reason not to use the fastest variant??

Quote
Quote from: lingo on April 22, 2009, 07:02:41 PM
... Slowwww... shame..shame  :lol
Quote from: jj2007 on April 22, 2009, 07:43:56 PM
Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story :bdg
hmm..., look like the beginning of a wonderfull love story, i'm just worry... who have planned to meet the taylor for the white dress ?  :eek

No such plans. She has not even sent her photo :tdown

NightWare

Quote from: jj2007 on April 23, 2009, 01:32:30 AM
I saw mentioned that they use different units, which might explain the speed difference. But is there any reason not to use the fastest variant??
the port used by the instruction can speedup/slowdown things, it depends of the other instructions of the algo. now concerning the fastest movdqu/movdqa it make sens, see the instruction like 2*64 bits compared to 4*32 bits... so the loop is divided by 2...  :wink