News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

How to get started with simd/sse ?

Started by BlackVortex, March 01, 2010, 08:04:09 PM

Previous topic - Next topic

BlackVortex

I'm interested in optimizing long memory operations, like searching in gigabytes of memory, and I believe SIMD is the way to go. Am I right ?  :bg

Searching around the forum topics is hard because "sse" is contained in assembly ...

Is there a good assembly+sse (sse2 at least) tutorial, or should I plunge into Intel's documentation ? Any useful links would be appreciated.

EDIT: Oh, I found this thread after better searching keywords :
http://www.masm32.com/board/index.php?topic=8498.0

dedndave


jj2007


BlackVortex

Quote from: jj2007 on March 01, 2010, 08:40:54 PM
search the forum for pcmpeqb
I found some more interesting threads, thanks.

Can you do me a little favour, plz ? I wanna timetest a code. Can you make a real quick barebones no-frame procedure that writes the 0BBBBBBBBh or some other dword, esi=starting offset, number of bytes in MEMCHUNK define.

As a test control, my function is this :

@ZeroMemPlain:
mov eax,esi
mov ecx, MEMCHUNK/4
.again:
mov D[eax],0bbbbbbbbh
add eax,4
dec ecx
jnz < .again
ret

:bg
It takes about 140ms to fill 256mb, on my pc.
I just want to be convinced that it's worth the speedup, I'm not interested in doing complex arithmetic with SSE.

EDIT: I fixed some stuff.

qWord

    .data
        align 16
        values db 16 dup (0bh)
    .code
    mov eax,esi
    movdqa xmm0,OWORD ptr values
    mov ecx,MEMCHUNK/16
@@: movdqa OWORD ptr [eax],xmm0  ; use movdqu if ESI is unaligned  (not recommended)
    lea eax,[eax+16]
    dec ecx
    jnz @B
@@:


FPU in a trice: SmplMath
It's that simple!

BlackVortex

Thanks qWord. It works fine, but the timings are exactly the same as before. Exactly. I guess idepends on memory throughput and not execution cycles.

At least it's fun to step over that instruction and see 128 bits of data moved at once  :green

hutch--

BlackVortex,

Memory speed limitations are the final limitation and the real advantage of SSE is its capacity to parallel process 128 bits of data at the same time. Reduced memory access count and parallel processing of the variety of data types it can handle will get the speed of many algorithms up by a long way but in raw data trasnfer memory will be the final limitation.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

oex

Am I right in thinking that

mov = 4 mem access = 2
mov 128 = 4 mem access = 8
movdqa = 1 mem access = 8

Kind of thing....

I have an SSE2 copy that is faster but I think it uses seperate registers so the speed up in my app might be due to that?
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv