It seems to me that very few people here know and use SSE in their programs. I could be wrong of course.
As for me, I don't know how to use the FPU nor SSE(any versioN) nor MMX. Am I missing something? Should I learn how to use them?
Most of my programs work with strings and numbers, I've never had any real use for floating point operations in my programs.
So do I.
SSE/MMX makes only sense if you have to calculate/to handle more than one 8/16/32-bit value at the same time.
In general I get more efficient results with standard ASM instructions, even if I ponder to use MMX/SSE-instructions.
Once you decide MMX/SSE you have to point your whole program/routine to this instruction set and you will get more interface-overhead.
Quote from: mitchi on April 19, 2009, 06:00:55 PM
Most of my programs work with strings and numbers, I've never had any real use for floating point operations in my programs.
Speed might be an argument, even for strings :wink
Cycles:
11812 InString, 1, addr Mainstr, addr TestSubX
1966 InstrSSE2, 1, addr Mainstr, addr TestSubD, 0
Another example (http://www.masm32.com/board/index.php?topic=11191.msg83115#msg83115)
Yea, that's a nice speedup! How hard was it for you to learn SSE, compared with the rest?
it's not very difficult when you are well documented, however if you want to learn simd usage, you must define your needs first (coz there is too much instructions, so you should select the appropriate set for your needs). MMX is essentially for gfx/colors manipulation, SSE is essentially for 3D stuff, SSE2 is for both, SSE3+ it's not big improvments
SSE can also be used to store and retrieve data to CPU registers instead of memory. And I use it to handle integers larger than 32 bit.
Quote from: mitchi on April 19, 2009, 11:09:15 PM
Yea, that's a nice speedup! How hard was it for you to learn SSE, compared with the rest?
Not harder than normal assembly; however, there is a confusingly large choice of instructions, many of them doing almost exactly the same. In practice, you can get along with only a few. Check my example from here (http://www.masm32.com/board/index.php?action=dlattach;topic=3601.0;id=6102)- only a handful...
16-byte alignment is a big issue; movups is a replacement for movaps, but a bit slower.
option prologue:none ; no stack frame
option epilogue:none
align 16
InstrJJ proc StartPos:DWORD, lpSource:DWORD, lpPattern:DWORD, sMode:DWORD
push esi
push edi
push ebx ; all registers preserved, except eax = return value
push ebp
push ecx
push edx
mov esi, [esp+6*4+2*4] ; lpSource
mov edx, [esp+6*4+3*4] ; lpPattern
movzx eax, word ptr [edx] ; 3 cycles to fill xmm3 with first word
test ah, ah
je ByteScan
imul eax, 00010001h ; propagate loword
movd xmm3, eax
pshufd xmm3, xmm3, 0 ; xmm3 holds first word of pattern
mov edi, [edx+2] ; next 4 bytes of pattern
mov eax, edi
or ebx, -1
.if al==0
xor ebx, ebx ; byte 3 is zero
mov edi, ebx
.elseif ah==0
movzx ebx, bl ; byte 4 is zero
and edi, ebx
.else
shr eax, 16
.if al==0
movzx ebx, bx ; byte 5 is zero (= and ebx, 0FFFFh)
.elseif ah==0
and ebx, 0ffffffh ; byte 6 is zero
.endif
.endif
and edi, ebx ; apply mask for bytes 2-5
test esi, 15 ; aligned?
je L0 ; if aligned, clear ebp and go directly into the main loop
movups xmm1, [esi] ; load 16 bytes from current unaligned address
movups xmm4, [esi+1] ; load another16 bytes
mov ebp, esi ; save unaligned address
and esi, -16 ; align esi downwards
jmp @F
L0: xor ebp, ebp
L1: movaps xmm1, [esi] ; load 16 bytes from current aligned address
movups xmm4, [esi+1] ; load another 16 bytes
@@: movaps xmm2, xmm1 ; save 16 bytes for testing the zero delimiter
lea esi, [esi+16] ; len counter (moving up/down lea or add costs cycles)
pcmpeqw xmm1, xmm3 ; compare packed words in xmm1 and xmm3 for equality
pcmpeqb xmm2, xmm1 ; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
pcmpeqw xmm4, xmm3 ; compare packed words in xmm4 and xmm3 for equality
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern word
pmovmskb eax, xmm2 ; set byte mask in ecx for zero delimiter byte
pmovmskb ecx, xmm4 ; set byte mask in edx for search pattern word
shl ecx, 1 ; adjust for esi+1 (add ecx, ecx is a lot slower)
test eax, eax ; zero byte found?
jnz @F ; check ebp, then ChkNull
or edx, ecx ; one of them needs to have the word
jz L1 ; 0=no pattern byte found, go back
@@: test ebp, ebp ; 0=never unaligned, or second loop
je @F ; ebp=16*n+1....15 ->esi=16*n+16, i.e. esi>ebp
add ebp, 16
.if ebp<esi ; at least second loop
xor ebp, ebp
.endif
and ebp, 15
@@: test eax, eax
jnz ChkNull
@@: bsf ecx, edx ; bit scan for the index --------------------------
lea eax, [esi+ecx-15]
mov eax, [eax+ebp+1] ; first unaligned chunk contains match
btr edx, ecx ; clear bit ecx in edx
and eax, ebx
cmp eax, edi
je FoundPattern
BadLuck:
xor ebp, ebp
test edx, edx
jnz @B ; bit scan end ------------------------------------------
jmp L1 ; 0=no more hits in these 16 bytes, go back searching (reversing order is somewhat slower)
ChkNull:
mov ebx, eax ; position of zero byte
xor eax, eax ; default: 0=no match
or edx, ecx ; one of them needs to have the word
je NoMatch
bsf ebx, ebx ; nullbyte index in ebx
bsf ecx, edx ; pattern word index in ecx
cmp ebx, ecx ; null before pattern word: outta here
jb NoMatch
cmp [esi+ecx-14], edi
jne NoMatch ; first dword after first word doesn't match, so get out
FoundPattern: ; we need to check the complete string here
test edi, edi ; one-word pattern?
je Match
push edi
push esi
mov edi, [esp+6*4+3*4+8] ; lpPattern
lea esi, [esi+ecx-16]
add esi, ebp
@@: inc edi
inc esi
movzx eax, byte ptr [edi]
test eax, eax
je @F
cmp al, byte ptr [esi]
je @B
@@: pop esi
pop edi
test eax, eax
jne BadLuck
Match:
sub esi, [esp+6*4+2*4] ; lpSource: subtract original src pointer
lea eax, [esi+ecx-15] ; and adjust for the index
add eax, ebp ; ebp = offset in first unaligned chunk
NoMatch:
pop edx ; 6 registers
pop ecx
pop ebp
pop ebx
pop edi
pop esi
ret 4*4 ; 4 arguments
ByteScan:
imul eax, 01010101h ; propagate lobyte
movd xmm3, eax
pshufd xmm3, xmm3, 0 ; xmm3 holds first word of pattern
@@: movups xmm1, [esi] ; load 16 bytes from current aligned address
movaps xmm2, xmm1 ; save 16 bytes for testing the zero delimiter
lea esi, [esi+16] ; len counter (moving up/down lea or add costs cycles)
pcmpeqb xmm1, xmm3 ; compare packed bytes in xmm1 and xmm3 (elephant) for equality
pcmpeqb xmm2, xmm1 ; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern word
pmovmskb ecx, xmm2 ; set byte mask in ecx for zero delimiter
test ecx, ecx ; zero byte found?
jnz @F ; check ebp, then ChkNull
test edx, edx ; pattern found?
jz @B ; 0=no pattern byte found, go back
@@: xor eax, eax
bsf ecx, ecx
bsf edx, edx
cmp ecx, edx
ja NoMatch
lea eax, [esi+edx-15]
sub eax, [esp+6*4+2*4]
jmp NoMatch
InstrJJ endp
So who is going to lavish us with a series of SSE tutorials? :bg
It is so complicated with bad programming style, ugly and as results
works with strings ONLY and it is so slowwwww....shame, shame... :'(
just take a look:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
Search Test 1 - value expected 37; lenSrchPattern ->22
FJT2 Cresta/IanB, byte-length shifts: 4294967295 ; clocks: 248
FJT3 Cresta/IanB, word-length shifts: 4294967295 ; clocks: 202
FJT4 Cresta/IanB, dword-length shifts: 4294967295 ; clocks: 258
Boyer-Moore Lingo, byte-length shifts: 37 ; clocks: 94
Boyer-Moore Lingo, word-length shifts: 37 ; clocks: 120
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 143
InString - JJ: 38 ; clocks: 98
InString - Lingo: 37 ; clocks: 42
Search Test 2 - value expected 1007; lenSrchPattern ->17
FJT2 Cresta/IanB, byte-length shifts: 1007 ; clocks: 9808
FJT3 Cresta/IanB, word-length shifts: 1007 ; clocks: 9857
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 9820
Boyer-Moore Lingo, byte-length shifts: 1007 ; clocks: 7786
Boyer-Moore Lingo, word-length shifts: 1007 ; clocks: 7832
Boyer-Moore Lingo,dword-length shifts: 1007 ; clocks: 7793
InString - JJ: 1008 ; clocks: 22619
InString - Lingo: 1007 ; clocks: 8610
Search Test 3 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB, byte-length shifts: 1008 ; clocks: 646
FJT3 Cresta/IanB, word-length shifts: 1008 ; clocks: 597
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 662
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 479
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 497
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 528
InString - JJ: 1009 ; clocks: 715
InString - Lingo: 1008 ; clocks: 513
Search Test 4 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB, byte-length shifts: 1008 ; clocks: 2314
FJT3 Cresta/IanB, word-length shifts: 1008 ; clocks: 1461
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2334
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1253
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1279
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1310
InString - JJ: 1009 ; clocks: 6539
InString - Lingo: 1008 ; clocks: 4453
Search Test 5 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB, byte-length shifts: 1008 ; clocks: 2477
FJT3 Cresta/IanB, word-length shifts: 1008 ; clocks: 1681
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2493
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1097
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1113
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1145
InString - JJ: 1009 ; clocks: 5428
InString - Lingo: 1008 ; clocks: 4145
Search Test 6 - value expected 1008 ;lenSrchPattern ->16
FJT2 Cresta/IanB, byte-length shifts: 1008 ; clocks: 760
FJT3 Cresta/IanB, word-length shifts: 1008 ; clocks: 714
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 777
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 580
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 606
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 642
InString - JJ: 1009 ; clocks: 628
InString - Lingo: 1008 ; clocks: 513
Search Test 7 - value expected 1009 ;lenSrchPattern ->14
FJT2 Cresta/IanB, byte-length shifts: 1009 ; clocks: 951
FJT3 Cresta/IanB, word-length shifts: 1009 ; clocks: 905
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 968
Boyer-Moore Lingo, byte-length shifts: 1009 ; clocks: 767
Boyer-Moore Lingo, word-length shifts: 1009 ; clocks: 792
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 830
InString - JJ: 1010 ; clocks: 624
InString - Lingo: 1009 ; clocks: 513
Press ENTER to exit...
Call the moderators to help you.... :lol
Quote from: lingo on April 22, 2009, 05:27:12 PM
It is so complicated with bad programming style, ugly
Yeah, that's a known problem. When do you finally learn to comment your code??
Quote
and as results works with strings ONLY and it is so slowwwww....shame, shame... :'(
Well, at least my code works just fine with strings, instead of throwing exceptions like yours (http://www.masm32.com/board/index.php?topic=3601.msg83512#msg83512) if no match is found. Furthermore, it works with any pattern length (yours needs 8 bytes minimum, right?), and for normal, i.e. non exotic cases, it is a factor 7-8 faster than the Masm32 library InString. I am a modest person, a factor 7 faster is enough for me :bg
"instead of throwing exceptions like yours if no match is found."
Due to the numbers of the result do you want to abuse me? :naughty:
Call the moderators for me because I'm not guilty
that you are impotent to use the code properly :lol
Slowwww... shame..shame :lol
Quote from: lingo on April 22, 2009, 07:02:41 PM
"instead of throwing exceptions like yours if no match is found."
Due to the numbers of the result do you want to abuse me? :naughty:
Call the moderators for me because I'm not guilty
that you are impotent to use the code properly :lol
Slowwww... shame..shame :lol
Hey, my angry young friend, RTFM: The title of the thread you are referring to is "String searching"; from the Masm32 library help file: "InString searches for a substring in a larger string". That is what most of the algos in that thread do successfully. Except yours, which crashes on the rather simple task of (not) finding "duplicate inx" at the end of windows.inc ... ::)
On the positive side, I see that nowadays you have cautiously started to comment your code:
BMLinDD proc
...
movd mm5, esp ; save esp register
...
Congratulations, Lingo :U Although I have a suspicion that some of the seasoned old hands here might complain that you state the obvious, it is a step in the right direction! :clap:
Me personally, I would have added
; save esp register and trash the FPU, but that's yet another story (http://www.masm32.com/board/index.php?topic=10830.msg79298#msg79298) :bdg
Quote from: Mark Jones on April 22, 2009, 04:52:08 PM
So who is going to lavish us with a series of SSE tutorials? :bg
hi, the usage of sse/sse2 (single/double precision) is quite limited if you don't do 3D stuff, most of the sse/sse2 hints you can see in the algos posted here are just deviance of the normal use of thoses instructions.
and if you do 3D stuff it's essentially math tutorials that are needed, coz matrix*matrix, matrix*vector, conditionnal selection of a vector, transposing matrix, etc... are the essential, and the possible instructions are obvious in this case.
Quote from: lingo on April 22, 2009, 07:02:41 PM
... Slowwww... shame..shame :lol
Quote from: jj2007 on April 22, 2009, 07:43:56 PM
Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story (http://www.masm32.com/board/index.php?topic=10830.msg79298#msg79298) :bdg
hmm..., look like the beginning of a wonderfull love story, i'm just worry... who have planned to meet the taylor for the white dress ? :eek
Quote from: NightWare on April 22, 2009, 11:40:45 PM
most of the sse/sse2 hints you can see in the algos posted here are just deviance of the normal use of thoses instructions.
Correct - this kind of algo is not what SSE2 was originally meant for. But it works :bg
Maybe you can solve a mystery for me; I use movaps+movups for moving integers around in my inner loop:
L1: movaps xmm1, [esi] ; load 16 bytes from current aligned address
movups xmm4, [esi+1] ; load another 16 bytes
@@: movaps xmm2, xmm1 ; save 16 bytes for testing the zero delimiter
lea esi, [esi+16]
...
Tests with the "official" movdqa+movdqu are roughly 2% slower. Intel says (http://software.intel.com/en-us/articles/memcpy-performance/):
"SSE2 movdqu/movdqa instructions were introduced specifically for this purpose. movdqa is suitable for 16-byte aligned operands. movdqu is suitable for fetching byte-aligned groups of 16 bytes from memory, but not useful for storing them.
The Barcelona architecture prefers movaps for stores.
movaps, movdqa, and movapd are functionally equivalent, with movaps having shorter encoding."
I saw mentioned that they use different units (http://objectmix.com/asm-x86-asm-370/377453-population-count-sse2-again-3.html), which might explain the speed difference. But is there any reason not to use the fastest variant??
Quote
Quote from: lingo on April 22, 2009, 07:02:41 PM
... Slowwww... shame..shame :lol
Quote from: jj2007 on April 22, 2009, 07:43:56 PM
Me personally, I would have added ; save esp register and trash the FPU, but that's yet another story (http://www.masm32.com/board/index.php?topic=10830.msg79298#msg79298) :bdg
hmm..., look like the beginning of a wonderfull love story, i'm just worry... who have planned to meet the taylor for the white dress ? :eek
No such plans. She has not even sent her photo :tdown
Quote from: jj2007 on April 23, 2009, 01:32:30 AM
I saw mentioned that they use different units, which might explain the speed difference. But is there any reason not to use the fastest variant??
the port used by the instruction can speedup/slowdown things, it depends of the other instructions of the algo. now concerning the fastest movdqu/movdqa it make sens, see the instruction like 2*64 bits compared to 4*32 bits... so the loop is divided by 2... :wink
Quote from: NightWare on April 23, 2009, 01:57:36 AMnow concerning the fastest movdqu/movdqa it make sens, see the instruction like 2*64 bits compared to 4*32 bits... so the loop is divided by 2... :wink
But movdqa is actually
slower than movaps... ::)
"But movdqa is actually slower than movaps."
again nonsense...and two tests more:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
Search Test 8 - value expected 22646 ;lenSrchPattern ->7260(1C5Ch)
Boyer-Moore Lingo, word-length shifts: 22646 ; clocks: 24483
Boyer-Moore Lingo,dword-length shifts: 22646 ; clocks: 27209
InString - JJ: 22647 ; clocks: 36692
InString - Lingo: 22646 ; clocks: 19656
Search Test 9 -Find 'Duplicate inc' in 'windows.inc' ; lenSrchPattern ->13
Boyer-Moore Lingo, word-length shifts: 1127624 ; clocks: 898528
Boyer-Moore Lingo,dword-length shifts: 1127624 ; clocks: 898721
InString - JJ: 1127625 ; clocks: 680112
InString - Lingo: 1127624 ; clocks: 561030
Press ENTER to exit...
Slowwww again and ...shame, shame... :lol
Quote from: lingo on April 23, 2009, 05:53:52 AM
"But movdqa is actually slower than movaps."
again nonsense...
As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. That is what I did, and on a Celeron M aps+ups are slightly faster. On a Prescott P4, they seem to be roughly equivalent; both statements refer to the inner loop of the InstringJJ posted above.
What is really odd, though, is the performance of mov xmm1, xmm2 on a P4.
Here are the timings for a Celeron M:
Aligned, mem to xmm:
7 cycles for 4* movaps
7 cycles for 4* movapd
7 cycles for 4* movdqa
Unaligned, mem to xmm:
13 cycles for 4* movups
13 cycles for 4* movupd
13 cycles for 4* movdqu
Aligned, xmm to xmm:
4 cycles for 4* movaps
4 cycles for 4* movapd
4 cycles for 4* movdqa
Aligned, xmm to MEM to xmm:
15 cycles for 4* movaps
15 cycles for 4* movapd
15 cycles for 4* movdqa
And here the P4:
Aligned, mem to xmm:
4 cycles for 4* movaps
3 cycles for 4* movapd
3 cycles for 4* movdqa
Unaligned, mem to xmm:
25 cycles for 4* movups
26 cycles for 4* movupd
25 cycles for 4* movdqu
Aligned, xmm to xmm:
27 cycles for 4* movaps <---------------------------------
27 cycles for 4* movapd
27 cycles for 4* movdqa
Aligned, xmm to MEM to xmm:
17 cycles for 4* movaps <---------------------------------
17 cycles for 4* movapd
17 cycles for 4* movdqa
Surprisingly,
movdqa [esi], xmm0
movdqa xmm1, [esi]
is faster than a simple
movdqa xmm1, xmm0
For the aficionados, I attach the testbed. I could not reproduce that speed gain in a real life algo (the inner loop of the InstringJJ posted above).
[attachment deleted by admin]
"As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. "
I did it in some years ago and my results are similar to link.. (http://flashlight.slad.cz/files/memtest_p4_northwood.txt)
The opposite "your" information is a steal from generic optimization of memcpy() here..] (http://74.125.95.132/search?q=cache:LUF0KefzXzEJ:software.intel.com/en-us/articles/memcpy-performance/+movdqa+movaps&cd=10&hl=en&ct=clnk&c)
Hence, "I could not reproduce that speed gain in a real life algo" is a proof that
"But movdqa is actually slower than movaps." is a nonsense
Hence, nonsense here is just the true rather than an 'emotional abuse'... :lol
Quote from: Mark Jones on April 22, 2009, 04:52:08 PM
So who is going to lavish us with a series of SSE tutorials? :bg
Please, please team up and give us some tutorials :bg
Imagine all that energy/brain power combined :dazzled:
mark jones wrote. . .
QuoteSo who is going to lavish us with a series of SSE tutorials? BigGrin
yes.. that would be good.
There are some links to reference material & a tute on SSE somewhere on the forum I think
Quote from: lingo on April 23, 2009, 04:39:37 PM
"As an engineer, you should call "nonsense" only what you can measure yourself, under controlled conditions. "
I did it in some years ago and my results are similar to link.. (http://flashlight.slad.cz/files/memtest_p4_northwood.txt)
The opposite "your" information is a steal from generic optimization of memcpy() here..] (http://74.125.95.132/search?q=cache:LUF0KefzXzEJ:software.intel.com/en-us/articles/memcpy-performance/+movdqa+movaps&cd=10&hl=en&ct=clnk&c)
Hence, "I could not reproduce that speed gain in a real life algo" is a proof that
"But movdqa is actually slower than movaps." is a nonsense
Hence, nonsense here is just the true rather than an 'emotional abuse'... :lol
First, you seem to have a serious problem with the concepts of "stealing" and "quoting".
Second, when you post a link, read it at least carefully before crying nonsense:
QuoteMemory copy routines tester by Petr Supina, 2005-2006
Block size: 2 x 16 Bytes
Method Time [ns]
------ ---------
movaps: 6.71704
movdqa: 6.75457
Block size: 2 x 256 Bytes
Method Time [ns]
movaps: 38.7073
movdqa: 38.7663
That pattern changes for larger block sizes, but
your link proves that
my observation was correct.
Thank you :U
And just for fun, here the results for my Celeron M, obtained with the Supina software via
your link:
Block size: 2 x 16 Bytes
Method Time [ns]
------ ---------
movaps: 11.0079
movdqa: 11.1622
Block size: 2 x 256 Bytes
Method Time [ns]
movaps: 37.8871
movdqa: 39.8011
Block size: 2 x 4096 Bytes
Method Time [us]
------ ---------
movaps: 0.552621
movdqa: 0.556335
Good night Lingo :8)
Quote from: jj2007 on April 23, 2009, 02:10:08 AM
But movdqa is actually slower than movaps... ::)
no, it CAN be slower, but because of other factors, like a different alignment of the following instructions (coz code alignment count), otherwise the extra byte read is quickly absorbed by the loop, once the algo is in the cache. it can also depends of the number of uops for the port, AND the normal number of uops treated by your cpu.
Quote from: NightWare on April 23, 2009, 09:30:52 PM
Quote from: jj2007 on April 23, 2009, 02:10:08 AM
But movdqa is actually slower than movaps... ::)
no, it CAN be slower, but because of other factors, like a different alignment of the following instructions (coz code alignment count), otherwise the extra byte read is quickly absorbed by the loop, once the algo is in the cache. it can also depends of the number of uops for the port, AND the normal number of uops treated by your cpu.
Agreed, although code alignment plays a surprisingly small role on modern CPU's. I have tested that with a
REPEAT n
nop
ENDM
right before my inner loop, and the effect is nil on my Celeron, and small on my P4.
Another problem is the consistency of timings. I ran another test with this software (http://flashlight.slad.cz/files/memtest-2006-04-12.zip), see attachment - outliers everywhere, and the supplied 6-digits precision is clearly misleading.
[attachment deleted by admin]
http://www.masm32.com/board/index.php?topic=8498.0
Needs an update on more sites and report on dead links.
Draakie
"although code alignment plays a surprisingly small role on modern CPU's."
nonsense again...
- If I'm not wrong you have no experience with modern CPUs because your CPUs are still archaic
" I have tested that with a
REPEAT n
nop
ENDM
right before my inner loop, and the effect is nil on my Celeron, and small on my P4."
You are the champion in slow code, hence your code is the slowest before to start your test
It is the reason that after the test it can't be slower... :lol
I know that you can't control your emotions but you can try your test with my code... :lol
Quote from: Draakie on April 24, 2009, 11:26:14 AM
http://www.masm32.com/board/index.php?topic=8498.0
Needs an update on more sites and report on dead links.
Draakie
Thanks mate
Quote from: lingo on April 24, 2009, 01:30:24 PM
I know that you can't control your emotions but you can try your test with my code... :lol
My PC hates exceptions :bg
OK, no offend but from your code I see that you are still a mad code pilferer and newbie spaghetti code creator without any ideas and experience in programming
Due to your age and lack of interest to learn new things from other's experience ( A.Fog, etc.) you will stay mad newbie (with level and interests like this (http://www.masm32.com/board/index.php?topic=7653.0) or this (http://www.masm32.com/board/index.php?topic=10780.msg78979#msg78979) or this (http://www.masm32.com/board/index.php?topic=9383.0) or this (http://www.masm32.com/board/index.php?topic=9782.0) etc.) until end of your life. It is the reason that I loose interest in and don't want to loose my time for people like you.. :tdown Sorry mad leaky watering-pot and don't forget to get your medicine now! :lol
Quote from: Draakie on April 24, 2009, 11:26:14 AM
http://www.masm32.com/board/index.php?topic=8498.0
Needs an update on more sites and report on dead links.
Draakie
Thanks, Draakie. The neilkemp and dennishome links are more or less dead, while http://www.tommesani.com/Docs.html is still one of the better sources. Jorgon (http://www.jorgon.freeserve.co.uk/TestbugHelp/XMMintins.htm) has a good intro, too.
An excellent complete reference is here (http://www.ews.uiuc.edu/~cjiang/reference/index.htm).
What I really miss is an in-depth discussion of what exactly are the rules for using/mixing the float and integer instructions. Some sources say that movaps, movapd and movdqa are "functionally equivalent"... so why do we need them all?
.. an alt to the dead link
http://www.neilkemp.us/v4/articles/sse_tutorial/sse_tutorial.html
and another..
http://softpixel.com/~cwright/programming/simd/mmx.php
Quote from: Rainstorm on April 25, 2009, 08:20:01 AM
.. an alt to the dead link
http://www.neilkemp.us/v4/articles/sse_tutorial/sse_tutorial.html
and another..
http://softpixel.com/~cwright/programming/simd/mmx.php
Thank you very much !
http://www.masm32.com/board/index.php?topic=782.0
It might be a dead link too and only ran on 32 windows.