News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

The fastest way to clear a buffer

Started by frktons, August 24, 2010, 08:47:34 PM

Previous topic - Next topic

dedndave

no prob JJ   :P
it feels good to catch you, once in a while

frktons

After some experimentation I got these results:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
2059    cycles for RtlZeroMemory
4023    cycles for FrkTons
2070    cycles for rep stosd
1062    cycles for movdqa
1062    cycles for movaps
1024    cycles for FrkTons New
5023    cycles for movups
5050    cycles for movupd

2087    cycles for RtlZeroMemory
4043    cycles for FrkTons
2064    cycles for rep stosd
1038    cycles for movdqa
1050    cycles for movaps
1016    cycles for FrkTons New
5036    cycles for movups
5042    cycles for movupd


--- ok ---


How can it be possible?
The new test attached.

Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2293    cycles for RtlZeroMemory
4029    cycles for FrkTons
2272    cycles for rep stosd
2017    cycles for movdqa
2018    cycles for movaps
2140    cycles for FrkTons New
6026    cycles for movups
6021    cycles for movupd


Can't see any surprises in here ::)

frktons

Quote from: jj2007 on September 03, 2010, 08:44:08 PM
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2293    cycles for RtlZeroMemory
4029    cycles for FrkTons
2272    cycles for rep stosd
2017    cycles for movdqa
2018    cycles for movaps
2140    cycles for FrkTons New
6026    cycles for movups
6021    cycles for movupd


Can't see any surprises in here ::)

I should have imagined that I did something wrong  :P

Maybe if you use an older CPU the program wouldn't even run  :lol
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Hi, Frank!

This is results on my CPU:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
5144    cycles for RtlZeroMemory
8253    cycles for FrkTons
4952    cycles for rep stosd
4790    cycles for movdqa
4801    cycles for movaps
4862    cycles for FrkTons New
10594   cycles for movups
10601   cycles for movupd

4960    cycles for RtlZeroMemory
8383    cycles for FrkTons
4952    cycles for rep stosd
4820    cycles for movdqa
4795    cycles for movaps
4861    cycles for FrkTons New
10598   cycles for movups
10602   cycles for movupd




Alex

jj2007

Quote from: frktons on September 03, 2010, 09:25:34 PM
Maybe if you use an older CPU the program wouldn't even run  :lol

You would need a very old CPU.

On mine, you can save 40 cycles with a small modification:
mov edx, offset Dest
lea ecx, [edx+16000]
mov eax, 20202020h
movd xmm0, eax
pshufd xmm0, xmm0, 0
;                  movdqa xmm1, xmm0
;                  movdqa xmm2, xmm0
;                  movdqa xmm3, xmm0
;                  movdqa xmm4, xmm0                                     

@@:
movdqa [edx], xmm0
movdqa [edx + 16], xmm0
movdqa [edx + 32], xmm0
movdqa [edx + 48], xmm0
movdqa [edx + 64], xmm0
add edx, 80
cmp edx, ecx
jl @B

Antariy

And you can save some bytes, if use MOVAPS for moving to regs and to memory :)



Alex

frktons

Quote from: jj2007 on September 03, 2010, 10:50:43 PM
Quote from: frktons on September 03, 2010, 09:25:34 PM
Maybe if you use an older CPU the program wouldn't even run  :lol

You would need a very old CPU.

On mine, you can save 40 cycles with a small modification:
mov edx, offset Dest
lea ecx, [edx+16000]
mov eax, 20202020h
movd xmm0, eax
pshufd xmm0, xmm0, 0
;                  movdqa xmm1, xmm0
;                  movdqa xmm2, xmm0
;                  movdqa xmm3, xmm0
;                  movdqa xmm4, xmm0                                     

@@:
movdqa [edx], xmm0
movdqa [edx + 16], xmm0
movdqa [edx + 32], xmm0
movdqa [edx + 48], xmm0
movdqa [edx + 64], xmm0
add edx, 80
cmp edx, ecx
jl @B


I already tested this kind of unrolling, but the best performance on Core 2 duo
happens with 5 different XMM registers. The CPU architecture plays the big
role for the 20-50 cycles difference. Not that much anyway. In my opinion it's
just the cache memory that gives some extra speed on Core 2. I'd like to see
what these routines gain or loose on the more  recent quad/i3-i7 machines as well.

If anyone has got this newest kind of CPU.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

frktons

Quote from: Antariy on September 03, 2010, 10:54:27 PM
And you can save some bytes, if use MOVAPS for moving to regs and to memory :)

Alex


Yes Alex. I'm testing just the speed and MOVDQA looks a little bit faster than MOVAPS.
It's a very tiny difference indeed. At least on my machine.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Frank, on i3-i7 "rep stosd" must work faster on equally as SSE-code. I cannot check this.

What if write to memory with this command: "MOVNTDQ" - non-temporal write to memory - without caching.

Edited:
timings with direct writing:

13705   cycles for FrkTons New with MOVNTDQ
13819   cycles for FrkTons New with MOVNTPD
13787   cycles for FrkTons New with MOVNTPS



Other timings omited, because I have posted it already.

Alex

frktons

Quote from: Antariy on September 03, 2010, 11:07:59 PM
Frank, on i3-i7 "rep stosd" must work faster on equally as SSE-code. I cannot check this.

What if write to memory with this command: "MOVNTDQ" - non-temporal write to memory - without caching.

Alex


Alex MOVNTDQ is quite slow on my machine:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
2059    cycles for RtlZeroMemory
4029    cycles for FrkTons
2047    cycles for rep stosd
1033    cycles for movdqa
1032    cycles for movaps
1017    cycles for FrkTons New
5024    cycles for movups
5020    cycles for movupd

8047    cycles for MOVNTDQ

2090    cycles for RtlZeroMemory
4046    cycles for FrkTons
2062    cycles for rep stosd
1052    cycles for movdqa
1047    cycles for movaps
1017    cycles for FrkTons New
5036    cycles for movups
5042    cycles for movupd

7864    cycles for MOVNTDQ


--- ok ---


I suppose that if you could do:
rep/stosq with rxx 64 bit register  the results would be similar or better
than SSE2 instructions. But nobody has taken the task to compile with a 64 bit
assembler. I'll probably do it when I'll be more familiar with JWASM.
For the time being I don't find the time to study also that  :P

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Yes, I post my results for non-temporal writes also (in post, which ask for this).

MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.


Alex

frktons

Quote from: Antariy on September 03, 2010, 11:20:37 PM
Yes, I post my results for non-temporal writes also (in post, which ask for this).

MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.
Alex


I'm trying with 16MB, but the program is taking a lot of time to compile  ::)
Will it ever end?
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Quote from: frktons on September 03, 2010, 11:34:53 PM
Quote from: Antariy on September 03, 2010, 11:20:37 PM
Yes, I post my results for non-temporal writes also (in post, which ask for this).

MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.
Alex


I'm trying with 16MB, but the program is taking a lot of time to compile  ::)
Will it ever end?


This is known problem - in-exe allocation of data - maybe, "bug" of MASM. Try JWasm.
Or try allocate buffer with using of heap functions (like GlobalAlloc etc). Allocate 32 or 16MB buffer in heap, and use it in tests.



Alex

frktons

Quote from: Antariy on September 03, 2010, 11:46:07 PM

This is known problem - in-exe allocation of data - maybe, "bug" of MASM. Try JWasm.
Or try allocate buffer with using of heap functions (like GlobalAlloc etc). Allocate 32 or 16MB buffer in heap, and use it in tests.

Alex


All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank
Mind is like a parachute. You know what to do in order to use it :-)