I'm interested in optimizing long memory operations, like searching in gigabytes of memory, and I believe SIMD is the way to go. Am I right ? :bg
Searching around the forum topics is hard because "sse" is contained in assembly ...
Is there a good assembly+sse (sse2 at least) tutorial, or should I plunge into Intel's documentation ? Any useful links would be appreciated.
EDIT: Oh, I found this thread after better searching keywords :
http://www.masm32.com/board/index.php?topic=8498.0
from that link, i have been reading the tommasani docs...
http://www.tommesani.com/Docs.html
search the forum for pcmpeqb
Quote from: jj2007 on March 01, 2010, 08:40:54 PM
search the forum for pcmpeqb
I found some more interesting threads, thanks.
Can you do me a little favour, plz ? I wanna timetest a code. Can you make a real quick barebones no-frame procedure that writes the 0BBBBBBBBh or some other dword, esi=starting offset, number of bytes in MEMCHUNK define.
As a test control, my function is this :
@ZeroMemPlain:
mov eax,esi
mov ecx, MEMCHUNK/4
.again:
mov D[eax],0bbbbbbbbh
add eax,4
dec ecx
jnz < .again
ret
:bg
It takes about 140ms to fill 256mb, on my pc.
I just want to be convinced that it's worth the speedup, I'm not interested in doing complex arithmetic with SSE.
EDIT: I fixed some stuff.
.data
align 16
values db 16 dup (0bh)
.code
mov eax,esi
movdqa xmm0,OWORD ptr values
mov ecx,MEMCHUNK/16
@@: movdqa OWORD ptr [eax],xmm0 ; use movdqu if ESI is unaligned (not recommended)
lea eax,[eax+16]
dec ecx
jnz @B
@@:
Thanks qWord. It works fine, but the timings are exactly the same as before. Exactly. I guess idepends on memory throughput and not execution cycles.
At least it's fun to step over that instruction and see 128 bits of data moved at once :green
BlackVortex,
Memory speed limitations are the final limitation and the real advantage of SSE is its capacity to parallel process 128 bits of data at the same time. Reduced memory access count and parallel processing of the variety of data types it can handle will get the speed of many algorithms up by a long way but in raw data trasnfer memory will be the final limitation.
Am I right in thinking that
mov = 4 mem access = 2
mov 128 = 4 mem access = 8
movdqa = 1 mem access = 8
Kind of thing....
I have an SSE2 copy that is faster but I think it uses seperate registers so the speed up in my app might be due to that?