News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

The fastest way to clear a buffer

Started by frktons, August 24, 2010, 08:47:34 PM

Previous topic - Next topic

Antariy

Quote from: frktons on September 03, 2010, 11:52:45 PM

All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank

For 16MB buffer


invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024


in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.



Alex

frktons

Quote from: Antariy on September 03, 2010, 11:57:37 PM
Quote from: frktons on September 03, 2010, 11:52:45 PM

All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank

For 16MB buffer


invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024


in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.

Alex


Alex, is this enough or have I to change something else?

.data?
align 16
; Dest db 16000000 dup(?) ; <------ don't use it anymore
DataPtr  dd ? ; <-------------- Pointer for data allocated

.code
start:
     push 1
     call ShowCpu ; print brand string and SSE level

      invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024

      mov DataPtr, eax 
     
REPEAT 2
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke RtlZeroMemory, DataPtr, 16000000 <----------------- is this use of DataPtr correct?
counter_end

Mind is like a parachute. You know what to do in order to use it :-)

frktons

Alex you were right, with big buffer I have these results:


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
11628116        cycles for RtlZeroMemory
15883968        cycles for FrkTons
10996290        cycles for rep stosd
15473203        cycles for movdqa
15480211        cycles for movaps
15471071        cycles for FrkTons New
15477872        cycles for movups
15445525        cycles for movupd

8082000 cycles for MOVNTDQ

10999930        cycles for RtlZeroMemory
15870714        cycles for FrkTons
11012185        cycles for rep stosd
15418317        cycles for movdqa
15427633        cycles for movaps
15416041        cycles for FrkTons New
15418995        cycles for movups
15415995        cycles for movupd

8162844 cycles for MOVNTDQ


--- ok ---


and rep/stosd is faster than sse2 instructions.
I modified the number of cycles to perform the test to 1,000
instead of 1 million, to make it shorter.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

clive

Absent a newer build here's the result from the last one

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
6354    cycles for RtlZeroMemory
10341   cycles for FrkTons
6335    cycles for rep stosd
4143    cycles for movdqa
4108    cycles for movaps
1667    cycles for FrkTons New
8220    cycles for movups
8340    cycles for movupd

6296    cycles for RtlZeroMemory
10333   cycles for FrkTons
6265    cycles for rep stosd
4117    cycles for movdqa
4153    cycles for movaps
1675    cycles for FrkTons New
8227    cycles for movups
8232    cycles for movupd


Core Solo

Genuine Intel(R) CPU           T1350  @ 1.86GHz (SSE3)
2319    cycles for RtlZeroMemory
4072    cycles for FrkTons
2305    cycles for rep stosd
2039    cycles for movdqa
2039    cycles for movaps
2155    cycles for FrkTons New
6098    cycles for movups
6088    cycles for movupd

2315    cycles for RtlZeroMemory
4082    cycles for FrkTons
2296    cycles for rep stosd
2038    cycles for movdqa
2038    cycles for movaps
2164    cycles for FrkTons New
6087    cycles for movups
6095    cycles for movupd
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on September 04, 2010, 12:38:21 AM
Absent a newer build here's the result from the last one

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
6354    cycles for RtlZeroMemory
10341   cycles for FrkTons
6335    cycles for rep stosd
4143    cycles for movdqa
4108    cycles for movaps
1667    cycles for FrkTons New
8220    cycles for movups
8340    cycles for movupd

6296    cycles for RtlZeroMemory
10333   cycles for FrkTons
6265    cycles for rep stosd
4117    cycles for movdqa
4153    cycles for movaps
1675    cycles for FrkTons New
8227    cycles for movups
8232    cycles for movupd


Core Solo

Genuine Intel(R) CPU           T1350  @ 1.86GHz (SSE3)
2319    cycles for RtlZeroMemory
4072    cycles for FrkTons
2305    cycles for rep stosd
2039    cycles for movdqa
2039    cycles for movaps
2155    cycles for FrkTons New
6098    cycles for movups
6088    cycles for movupd

2315    cycles for RtlZeroMemory
4082    cycles for FrkTons
2296    cycles for rep stosd
2038    cycles for movdqa
2038    cycles for movaps
2164    cycles for FrkTons New
6087    cycles for movups
6095    cycles for movupd


Wooops, The Atom really likes working with many XMM register at a time.
You are right clive I didn't post the new buid, so here it is.  :U
Mind is like a parachute. You know what to do in order to use it :-)

frktons

And for readability purpose here we have a version that formats with
thousand separator the results of elapsed CPU cycles.
This version tests a buffer of 16MB to fill.

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
11.967.493      cycles for RtlZeroMemory
15.682.875      cycles for FrkTons
10.971.464      cycles for rep stosd
15.418.911      cycles for movdqa
15.435.221      cycles for movaps
15.409.998      cycles for FrkTons New
15.405.469      cycles for movups
15.518.687      cycles for movupd
8.056.812       cycles for MOVNTDQ

11.051.772      cycles for RtlZeroMemory
15.535.943      cycles for FrkTons
10.997.179      cycles for rep stosd
15.467.940      cycles for movdqa
15.457.092      cycles for movaps
15.485.439      cycles for FrkTons New
15.514.719      cycles for movups
15.513.319      cycles for movupd
8.053.411       cycles for MOVNTDQ


--- ok ---


attached the "improved version".  :P

In my humble n00b-ist opinion, when we'll use REP/STOSQ in 64 bit
native OS with x64 machines,
it is going to win everything else. Not sure about MOVNTDQ anyway.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Rockoon

AMD Phenom(tm) II X6 1055T Processor (SSE3)
10.598.549      cycles for RtlZeroMemory
11.260.417      cycles for FrkTons
10.182.613      cycles for rep stosd
10.131.907      cycles for movdqa
10.115.035      cycles for movaps
10.214.832      cycles for FrkTons New
9.915.582       cycles for movups
10.188.273      cycles for movupd
6.888.405       cycles for MOVNTDQ

10.199.509      cycles for RtlZeroMemory
11.050.508      cycles for FrkTons
10.192.022      cycles for rep stosd
10.131.227      cycles for movdqa
10.113.104      cycles for movaps
10.217.227      cycles for FrkTons New
9.952.748       cycles for movups
10.184.520      cycles for movupd
6.700.808       cycles for MOVNTDQ
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
15.891.933      cycles for RtlZeroMemory
23.350.696      cycles for FrkTons
15.800.872      cycles for rep stosd
23.825.786      cycles for movdqa
23.886.914      cycles for movaps
23.872.424      cycles for FrkTons New
23.760.942      cycles for movups
23.730.159      cycles for movupd
9.818.186       cycles for MOVNTDQ

frktons

Quote from: jj2007 on September 04, 2010, 04:51:36 PM
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
15.891.933      cycles for RtlZeroMemory
23.350.696      cycles for FrkTons
15.800.872      cycles for rep stosd
23.825.786      cycles for movdqa
23.886.914      cycles for movaps
23.872.424      cycles for FrkTons New
23.760.942      cycles for movups
23.730.159      cycles for movupd
9.818.186       cycles for MOVNTDQ


Well for big buffers, like Alex said, MOVNTDQ is faster than anything else
on any machine, according to the tests  done so far.

The Atom of Clive was really impressive regarding the multiple use of
XMM registers. I hope he'll post his results for this test as well.
Mind is like a parachute. You know what to do in order to use it :-)

clive

Ok, this is from the original Acer Aspire One, I should try it on the newer one with the N450 CPU.

From ClearBufferNew3

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
23855499        cycles for RtlZeroMemory
23614698        cycles for FrkTons
23448786        cycles for rep stosd
23485010        cycles for movdqa
23496468        cycles for movaps
23578524        cycles for FrkTons New
23528033        cycles for movups
23437548        cycles for movupd

9166934 cycles for MOVNTDQ

23551485        cycles for RtlZeroMemory
23568079        cycles for FrkTons
23531873        cycles for rep stosd
23494767        cycles for movdqa
23480850        cycles for movaps
23560303        cycles for FrkTons New
23521236        cycles for movups
23485775        cycles for movupd

9156989 cycles for MOVNTDQ


From ClearBufferNew4

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
23.621.570      cycles for RtlZeroMemory
23.548.342      cycles for FrkTons
23.633.944      cycles for rep stosd
23.482.208      cycles for movdqa
23.584.646      cycles for movaps
23.504.980      cycles for FrkTons New
23.561.681      cycles for movups
23.543.870      cycles for movupd
9.184.061       cycles for MOVNTDQ

23.566.672      cycles for RtlZeroMemory
23.507.384      cycles for FrkTons
23.601.428      cycles for rep stosd
23.489.780      cycles for movdqa
23.512.724      cycles for movaps
23.516.591      cycles for FrkTons New
23.549.450      cycles for movups
23.596.780      cycles for movupd
9.156.032       cycles for MOVNTDQ
It could be a random act of randomness. Those happen a lot as well.

frktons

The Atom is again a surprise  :dazzled:
It always has terrific results, for the bad or for the good.

Thanks clive

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

This is trying make x64 app under machine which runs x64 code very badly :P
Post results, please - Frank wait for this very long time.


Frank, this is compiled with GoAsm.



Alex

frktons

Hi Alex, thanks for doing this test.
The results on my machine are:


---------------------------
MOVNTPD
---------------------------
Clocks: 4.090.321.879

(Press [Ctrl]+[C] for copying to clipboard)
---------------------------
OK   
---------------------------
---------------------------
REP STOSQ
---------------------------
Clocks: 1.977.888.757

(Press [Ctrl]+[C] for copying to clipboard)
---------------------------
OK   
---------------------------



What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?

Well these results seem to confirm my suspect that on 64 bit machine
REP/STOSQ is the fastest buffer filler instruction for the time being  :U

May I know how did you compile it?
I  can download GoASM and try to make some experiments.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Quote from: frktons on September 05, 2010, 09:47:46 PM
Hi Alex, thanks for doing this test.
The results on my machine are:


---------------------------
MOVNTPD
---------------------------
Clocks: 4.090.321.879

(Press [Ctrl]+[C] for copying to clipboard)
---------------------------
OK   
---------------------------


What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?

Frank


Frank you forgot past timings for REP STOSQ, they are in second message box.

I use 32mb buffer and 100 loops of test.



Alex
P.S. which timings of STOSQ?

frktons

I posted in the previous post Alex. Have a look.

REP/STOSQ looks a lot faster than MOVNTPD
at least on my machine.

Thanks again for doing the test.  :clap:

Frank
Mind is like a parachute. You know what to do in order to use it :-)