ZeroMemory with SSE2

woonsan · January 11, 2008, 07:55:11 PM

I used the method which filling memory as something with SSE technology.

Code is like below >
fmaszero proc pDest:DWORD, pLength:DWORD
xorps xmm0, xmm0 ; Fill memory as zero
mov edi, dword ptr [pDest] ; Put the address of buffer into EDI
mov ecx, dword ptr [pLength] ; Put the length of buffer into ECX
mov edx, ecx ; Copy ECX to EDX
shr ecx, 4d ; Divide by 16 (128-bit processing)
L_1:
movdqu oword ptr [edi], xmm0 ; Move zero data(null) to EDI's address
add edi, 16d ; Increase the pointer
dec ecx ; Decrease ECX
jnz L_1 ; If not zero, still
mov ecx, edx ; Reload count from EDX register
and ecx, 15d ; Divide result of 16d
xor al, al ; Fill AL as zero
rep stosb ; Fill memory
ret ; Return
fmaszero endp

I want to get the other people's opinion about this method.
In my opinion, it is the best method for speed.
What do you think?

swsnyder · January 11, 2008, 08:31:49 PM

1. Those unaligned writes (movdqu) are a performance killer. Align the pointer before starting the SSE loop.

2. The benefits of SSE vs. memset() vary by buffer length. Is this really faster for the buffers you'll use it on?

3. The break-even length for SSE vs stosd varies by processor.

4. Have you tried temporal vs non-temporal writes in your environment?

GregL · January 11, 2008, 08:45:40 PM

Stephanos,

If the memory block is 16 bytes or more it works fine.
If the memory block is less than 16 bytes, it overwrites other memory.

woonsan · January 11, 2008, 09:58:31 PM

Sorry, I wrote this code in my school and I didn't test it so it have some problem.
If so, what is the fastest method?

GregL · January 11, 2008, 10:02:45 PM

Stephanos,

Don't be sorry, it's a good idea. :U I was just pointing out a problem I saw.

I'll leave the "what is the fastest method" for someone else. It's debatable and varies with the processor.

woonsan · January 11, 2008, 10:22:15 PM

I tried speed test with this board's 'ZeroMemory Speed Test' kit.
So I can see its very low speed.
Microsoft's thing shows about 380 but my one shows about 700../
MOVDQU is really speed killer. So I tried MOVDQA.
When I use MOVDQA, it shows about 290 ... (The fastest)

Hmm.. Actually, MOVDQU is not good... From now, I may use MOVDQA.

hutch-- · January 11, 2008, 10:26:46 PM

Stephanos,

Aligned is almost always faster so its worth the effort to align the data so you can use the faster instruction. If you can organise it, use a non temporal write as it is not slowed down by the cache.

Vaguely I rememer that a 64 bit MMX fill is still faster than a 128 bit SSE version so it may be worth having a look at that as well.

woonsan · January 11, 2008, 10:36:52 PM

Hmm. I tried MMX instructions to fill memory like below.

fmaszero proc pDest:DWORD, pLength:DWORD
emms
; xorps xmm0, xmm0 ; Fill memory as zero
mov edi, dword ptr [pDest] ; Put the address of buffer into EDI
mov ecx, dword ptr [pLength] ; Put the length of buffer into ECX
mov edx, ecx ; Copy ECX to EDX
shr ecx, 3d ; Divide by 8 (64-bit processing)
L_1:
; movdqa oword ptr [edi], xmm0 ; Move zero data(null) to EDI's address
movq qword ptr [edi], mm0
; add edi, 16d ; Increase the pointer
add edi, 8d
dec ecx ; Decrease ECX
jnz L_1 ; If not zero, still
mov ecx, edx ; Reload count from EDX register
; and ecx, 15d ; Divide result of 16d
and ecx, 7d
xor al, al ; Fill AL as zero
rep stosb ; Fill memory
ret ; Return
fmaszero endp

It took about 380 thus I think that it is slower than SSE2 instruction's one.

How can I write with MMX it can have the fastest speed?

swsnyder · January 11, 2008, 11:01:53 PM

Quote from: hutch-- on January 11, 2008, 10:26:46 PM

Vaguely I rememer that a 64 bit MMX fill is still faster than a 128 bit SSE version so it may be worth having a look at that as well.

Maybe on some CPUs, but its about the same on my P3 (330MHz/256KB) and definately slower on my P4 (2.4GHz/512KB).

NightWare · January 11, 2008, 11:26:45 PM

i've posted a sse fast zeromem algo, here
http://www.masm32.com/board/index.php?topic=7458.0

and i'm quite sure a mmx algo can't beat it :toothy

hutch-- · January 11, 2008, 11:29:24 PM

Have a look at the instruction "movntq" for 64 bit fast fills. The action is in the NT part of the instruction, non temporal means it does not write back through the cache.

NightWare · January 11, 2008, 11:41:21 PM

:P ok, i'll take a look at this instruction, and i'll make a speed test (even if i'm quite sure of the result). i'll report the result...

NightWare · January 12, 2008, 12:09:05 AM

??? i was quite sure of the result, but don't expect this result !!!

Code Select

Resultats des tests de vitesse entre les differentes macros :


Routine RtlZeroMemory, effectuee en 181 cycles
eax = 0   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4284248


Routine 1, effectuee en 282 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 2, effectuee en 136 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 3, effectuee en 77 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 4, effectuee en 255 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 5, effectuee en 142 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 6, effectuee en 80 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 7, effectuee en 556 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 8, effectuee en 3 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Appuyez sur ENTER pour quitter...

for info :
routine 0 = rtlZeroMemory (the one from ntdll.dll)
routine 1 = my ALU zeromem
routine 2 = my MMX zeromem
routine 3 = my SSE zeromem
routine 4 = my unaligned ALU zeromem
routine 5 = my unaligned MMX zeromem
routine 6 = my unaligned SSE zeromem
routine 7 = unaligned MMX movntq zeromem
routine 8 = empty

for this test i use exactly the same instructions for routine 5 and routine 7, except i use movntq instead of movq...

woonsan · January 12, 2008, 12:46:00 AM

NightWare, your algorithm's processing speed is pretty good! hmm.. Actually, the best solution is algorithm only...

Jimg · January 12, 2008, 04:18:34 PM

NightWare-
How about including rep stosd as a baseline?

News:

ZeroMemory with SSE2

woonsan

swsnyder

woonsan

woonsan

woonsan

swsnyder

woonsan