News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

ZeroMemory with SSE2

Started by woonsan, January 11, 2008, 07:55:11 PM

Previous topic - Next topic

woonsan

I used the method which filling memory as something with SSE technology.

Code is like below >
fmaszero proc pDest:DWORD, pLength:DWORD
  xorps xmm0, xmm0 ; Fill memory as zero
  mov edi, dword ptr [pDest] ; Put the address of buffer into EDI
  mov ecx, dword ptr [pLength] ; Put the length of buffer into ECX
  mov edx, ecx ; Copy ECX to EDX
  shr ecx, 4d ; Divide by 16 (128-bit processing)
  L_1:
   movdqu oword ptr [edi], xmm0 ; Move zero data(null) to EDI's address
   add edi, 16d ; Increase the pointer
   dec ecx ; Decrease ECX
  jnz L_1 ; If not zero, still
  mov ecx, edx ; Reload count from EDX register
  and ecx, 15d ; Divide result of 16d
  xor al, al ; Fill AL as zero
  rep stosb ; Fill memory
  ret ; Return
fmaszero endp

I want to get the other people's opinion about this method.
In my opinion, it is the best method for speed.
What do you think?

swsnyder

1. Those unaligned writes (movdqu) are a performance killer.  Align the pointer before starting the SSE loop.

2. The benefits of SSE vs. memset() vary by buffer length. Is this really faster for the buffers you'll use it on?

3. The break-even length for SSE vs stosd varies by processor.

4. Have you tried temporal vs non-temporal writes in your environment?


GregL

Stephanos,


  • If the memory block is 16 bytes or more it works fine.
  • If the memory block is less than 16 bytes, it overwrites other memory.

woonsan

Sorry, I wrote this code in my school and I didn't test it so it have some problem.
If so, what is the fastest method?

GregL

Stephanos,

Don't be sorry, it's a good idea.  :U  I was just pointing out a problem I saw.

I'll leave the "what is the fastest method" for someone else. It's debatable and varies with the processor.


woonsan

I tried speed test with this board's 'ZeroMemory Speed Test' kit.
So I can see its very low speed.
Microsoft's thing shows about 380 but my one shows about 700../
MOVDQU is really speed killer. So I tried MOVDQA.
When I use MOVDQA, it shows about 290 ... (The fastest)

Hmm.. Actually, MOVDQU is not good... From now, I may use MOVDQA.

hutch--

Stephanos,

Aligned is almost always faster so its worth the effort to align the data so you can use the faster instruction. If you can organise it, use a non temporal write as it is not slowed down by the cache.

Vaguely I rememer that a 64 bit MMX fill is still faster than a 128 bit SSE version so it may be worth having a look at that as well.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

woonsan

Hmm. I tried MMX instructions to fill memory like below.

fmaszero proc pDest:DWORD, pLength:DWORD
  emms
;  xorps xmm0, xmm0 ; Fill memory as zero
  mov edi, dword ptr [pDest] ; Put the address of buffer into EDI
  mov ecx, dword ptr [pLength] ; Put the length of buffer into ECX
  mov edx, ecx ; Copy ECX to EDX
  shr ecx, 3d ; Divide by 8 (64-bit processing)
  L_1:
;   movdqa oword ptr [edi], xmm0 ; Move zero data(null) to EDI's address
   movq qword ptr [edi], mm0
;   add edi, 16d ; Increase the pointer
   add edi, 8d
   dec ecx ; Decrease ECX
  jnz L_1 ; If not zero, still
  mov ecx, edx ; Reload count from EDX register
;  and ecx, 15d ; Divide result of 16d
  and ecx, 7d
  xor al, al ; Fill AL as zero
  rep stosb ; Fill memory
  ret ; Return
fmaszero endp

It took about 380 thus I think that it is slower than SSE2 instruction's one.

How can I write with MMX it can have the fastest speed?

swsnyder

Quote from: hutch-- on January 11, 2008, 10:26:46 PM

Vaguely I rememer that a 64 bit MMX fill is still faster than a 128 bit SSE version so it may be worth having a look at that as well.

Maybe on some CPUs, but its about the same on my P3 (330MHz/256KB) and definately slower on my P4 (2.4GHz/512KB).

NightWare

i've posted a sse fast zeromem algo, here
http://www.masm32.com/board/index.php?topic=7458.0

and i'm quite sure a mmx algo can't beat it  :toothy

hutch--

Have a look at the instruction "movntq" for 64 bit fast fills. The action is in the NT part of the instruction, non temporal means it does not write back through the cache.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

NightWare

 :P ok, i'll take a look at this instruction, and i'll make a speed test (even if i'm quite sure of the result). i'll report the result...

NightWare

??? i was quite sure of the result, but don't expect this result !!!
Resultats des tests de vitesse entre les differentes macros :


Routine RtlZeroMemory, effectuee en 181 cycles
eax = 0   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4284248


Routine 1, effectuee en 282 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 2, effectuee en 136 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 3, effectuee en 77 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 4, effectuee en 255 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 5, effectuee en 142 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 6, effectuee en 80 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 7, effectuee en 556 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Routine 8, effectuee en 3 cycles
eax = 1024   ebx = 197016916/   ecx = 1991541844
edx = 1997672244   esi = 4278576   edi = 4276224


Appuyez sur ENTER pour quitter...


for info :
routine 0 = rtlZeroMemory (the one from ntdll.dll)
routine 1 = my ALU zeromem
routine 2 = my MMX zeromem
routine 3 = my SSE zeromem
routine 4 = my unaligned ALU zeromem
routine 5 = my unaligned MMX zeromem
routine 6 = my unaligned SSE zeromem
routine 7 = unaligned MMX movntq zeromem
routine 8 = empty

for this test i use exactly the same instructions for routine 5 and routine 7, except i use movntq instead of movq...

woonsan

NightWare, your algorithm's processing speed is pretty good! hmm.. Actually, the best solution is algorithm only...

Jimg

NightWare-
How about including  rep stosd  as a baseline?