The fastest way to clear a buffer

Antariy · September 03, 2010, 11:57:37 PM

Quote from: frktons on September 03, 2010, 11:52:45 PM

All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank

For 16MB buffer

Code Select


invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024

in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.

Alex

frktons · September 04, 2010, 12:05:27 AM

Quote from: Antariy on September 03, 2010, 11:57:37 PM
Quote from: frktons on September 03, 2010, 11:52:45 PM

All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank

For 16MB buffer

Code Select Expand
invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024

in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.

Alex

Alex, is this enough or have I to change something else?

Code Select


.data?
align 16
; Dest	db 16000000 dup(?) ; <------ don't use it anymore
DataPtr  dd ? ; <-------------- Pointer for data allocated

.code
start:
     push 1
     call ShowCpu				; print brand string and SSE level

      invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024

      mov DataPtr, eax  
      
	REPEAT 2
		invoke Sleep, 100
		counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
			invoke RtlZeroMemory, DataPtr, 16000000 <----------------- is this use of DataPtr correct?
		counter_end

frktons · September 04, 2010, 12:11:29 AM

Alex you were right, with big buffer I have these results:

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
11628116        cycles for RtlZeroMemory
15883968        cycles for FrkTons
10996290        cycles for rep stosd
15473203        cycles for movdqa
15480211        cycles for movaps
15471071        cycles for FrkTons New
15477872        cycles for movups
15445525        cycles for movupd

8082000 cycles for MOVNTDQ

10999930        cycles for RtlZeroMemory
15870714        cycles for FrkTons
11012185        cycles for rep stosd
15418317        cycles for movdqa
15427633        cycles for movaps
15416041        cycles for FrkTons New
15418995        cycles for movups
15415995        cycles for movupd

8162844 cycles for MOVNTDQ


--- ok ---

and rep/stosd is faster than sse2 instructions.
I modified the number of cycles to perform the test to 1,000
instead of 1 million, to make it shorter.

Frank

clive · September 04, 2010, 12:38:21 AM

Absent a newer build here's the result from the last one

Code Select

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
6354    cycles for RtlZeroMemory
10341   cycles for FrkTons
6335    cycles for rep stosd
4143    cycles for movdqa
4108    cycles for movaps
1667    cycles for FrkTons New
8220    cycles for movups
8340    cycles for movupd

6296    cycles for RtlZeroMemory
10333   cycles for FrkTons
6265    cycles for rep stosd
4117    cycles for movdqa
4153    cycles for movaps
1675    cycles for FrkTons New
8227    cycles for movups
8232    cycles for movupd

Core Solo

Code Select

Genuine Intel(R) CPU           T1350  @ 1.86GHz (SSE3)
2319    cycles for RtlZeroMemory
4072    cycles for FrkTons
2305    cycles for rep stosd
2039    cycles for movdqa
2039    cycles for movaps
2155    cycles for FrkTons New
6098    cycles for movups
6088    cycles for movupd

2315    cycles for RtlZeroMemory
4082    cycles for FrkTons
2296    cycles for rep stosd
2038    cycles for movdqa
2038    cycles for movaps
2164    cycles for FrkTons New
6087    cycles for movups
6095    cycles for movupd

frktons · September 04, 2010, 11:38:30 AM

Quote from: clive on September 04, 2010, 12:38:21 AM
Absent a newer build here's the result from the last one

Code Select Expand
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4) 6354 cycles for RtlZeroMemory 10341 cycles for FrkTons 6335 cycles for rep stosd 4143 cycles for movdqa 4108 cycles for movaps 1667 cycles for FrkTons New 8220 cycles for movups 8340 cycles for movupd 6296 cycles for RtlZeroMemory 10333 cycles for FrkTons 6265 cycles for rep stosd 4117 cycles for movdqa 4153 cycles for movaps 1675 cycles for FrkTons New 8227 cycles for movups 8232 cycles for movupd

Core Solo

Code Select Expand
Genuine Intel(R) CPU T1350 @ 1.86GHz (SSE3) 2319 cycles for RtlZeroMemory 4072 cycles for FrkTons 2305 cycles for rep stosd 2039 cycles for movdqa 2039 cycles for movaps 2155 cycles for FrkTons New 6098 cycles for movups 6088 cycles for movupd 2315 cycles for RtlZeroMemory 4082 cycles for FrkTons 2296 cycles for rep stosd 2038 cycles for movdqa 2038 cycles for movaps 2164 cycles for FrkTons New 6087 cycles for movups 6095 cycles for movupd

Wooops, The Atom really likes working with many XMM register at a time.
You are right clive I didn't post the new buid, so here it is. :U

frktons · September 04, 2010, 02:57:25 PM

And for readability purpose here we have a version that formats with
thousand separator the results of elapsed CPU cycles.
This version tests a buffer of 16MB to fill.

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
11.967.493      cycles for RtlZeroMemory
15.682.875      cycles for FrkTons
10.971.464      cycles for rep stosd
15.418.911      cycles for movdqa
15.435.221      cycles for movaps
15.409.998      cycles for FrkTons New
15.405.469      cycles for movups
15.518.687      cycles for movupd
8.056.812       cycles for MOVNTDQ

11.051.772      cycles for RtlZeroMemory
15.535.943      cycles for FrkTons
10.997.179      cycles for rep stosd
15.467.940      cycles for movdqa
15.457.092      cycles for movaps
15.485.439      cycles for FrkTons New
15.514.719      cycles for movups
15.513.319      cycles for movupd
8.053.411       cycles for MOVNTDQ


--- ok ---

attached the "improved version". :P

In my humble n00b-ist opinion, when we'll use REP/STOSQ in 64 bit
native OS with x64 machines,
it is going to win everything else. Not sure about MOVNTDQ anyway.

Frank

Rockoon · September 04, 2010, 03:11:41 PM

AMD Phenom(tm) II X6 1055T Processor (SSE3)
10.598.549 cycles for RtlZeroMemory
11.260.417 cycles for FrkTons
10.182.613 cycles for rep stosd
10.131.907 cycles for movdqa
10.115.035 cycles for movaps
10.214.832 cycles for FrkTons New
9.915.582 cycles for movups
10.188.273 cycles for movupd
6.888.405 cycles for MOVNTDQ

10.199.509 cycles for RtlZeroMemory
11.050.508 cycles for FrkTons
10.192.022 cycles for rep stosd
10.131.227 cycles for movdqa
10.113.104 cycles for movaps
10.217.227 cycles for FrkTons New
9.952.748 cycles for movups
10.184.520 cycles for movupd
6.700.808 cycles for MOVNTDQ

jj2007 · September 04, 2010, 04:51:36 PM

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
15.891.933      cycles for RtlZeroMemory
23.350.696      cycles for FrkTons
15.800.872      cycles for rep stosd
23.825.786      cycles for movdqa
23.886.914      cycles for movaps
23.872.424      cycles for FrkTons New
23.760.942      cycles for movups
23.730.159      cycles for movupd
9.818.186       cycles for MOVNTDQ

frktons · September 04, 2010, 05:05:25 PM

Quote from: jj2007 on September 04, 2010, 04:51:36 PM
Code Select Expand
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 15.891.933 cycles for RtlZeroMemory 23.350.696 cycles for FrkTons 15.800.872 cycles for rep stosd 23.825.786 cycles for movdqa 23.886.914 cycles for movaps 23.872.424 cycles for FrkTons New 23.760.942 cycles for movups 23.730.159 cycles for movupd 9.818.186 cycles for MOVNTDQ

Well for big buffers, like Alex said, MOVNTDQ is faster than anything else
on any machine, according to the tests done so far.

The Atom of Clive was really impressive regarding the multiple use of
XMM registers. I hope he'll post his results for this test as well.

clive · September 05, 2010, 01:30:38 PM

Ok, this is from the original Acer Aspire One, I should try it on the newer one with the N450 CPU.

From ClearBufferNew3

Code Select

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
23855499        cycles for RtlZeroMemory
23614698        cycles for FrkTons
23448786        cycles for rep stosd
23485010        cycles for movdqa
23496468        cycles for movaps
23578524        cycles for FrkTons New
23528033        cycles for movups
23437548        cycles for movupd

9166934 cycles for MOVNTDQ

23551485        cycles for RtlZeroMemory
23568079        cycles for FrkTons
23531873        cycles for rep stosd
23494767        cycles for movdqa
23480850        cycles for movaps
23560303        cycles for FrkTons New
23521236        cycles for movups
23485775        cycles for movupd

9156989 cycles for MOVNTDQ

From ClearBufferNew4

Code Select

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
23.621.570      cycles for RtlZeroMemory
23.548.342      cycles for FrkTons
23.633.944      cycles for rep stosd
23.482.208      cycles for movdqa
23.584.646      cycles for movaps
23.504.980      cycles for FrkTons New
23.561.681      cycles for movups
23.543.870      cycles for movupd
9.184.061       cycles for MOVNTDQ

23.566.672      cycles for RtlZeroMemory
23.507.384      cycles for FrkTons
23.601.428      cycles for rep stosd
23.489.780      cycles for movdqa
23.512.724      cycles for movaps
23.516.591      cycles for FrkTons New
23.549.450      cycles for movups
23.596.780      cycles for movupd
9.156.032       cycles for MOVNTDQ

frktons · September 05, 2010, 03:41:53 PM

The Atom is again a surprise :dazzled:
It always has terrific results, for the bad or for the good.

Thanks clive

Frank

Antariy · September 05, 2010, 09:41:24 PM

This is trying make x64 app under machine which runs x64 code very badly :P
Post results, please - Frank wait for this very long time.

Frank, this is compiled with GoAsm.

Alex

frktons · September 05, 2010, 09:47:46 PM

Hi Alex, thanks for doing this test.
The results on my machine are:

Code Select


---------------------------
MOVNTPD
---------------------------
Clocks: 4.090.321.879

(Press [Ctrl]+[C] for copying to clipboard)	
---------------------------
OK   
---------------------------
---------------------------
REP STOSQ
---------------------------
Clocks: 1.977.888.757

(Press [Ctrl]+[C] for copying to clipboard)	
---------------------------
OK   
---------------------------

What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?

Well these results seem to confirm my suspect that on 64 bit machine
REP/STOSQ is the fastest buffer filler instruction for the time being :U

May I know how did you compile it?
I can download GoASM and try to make some experiments.

Frank

Antariy · September 05, 2010, 09:52:30 PM

Quote from: frktons on September 05, 2010, 09:47:46 PM
Hi Alex, thanks for doing this test.
The results on my machine are:

Code Select Expand
--------------------------- MOVNTPD --------------------------- Clocks: 4.090.321.879 (Press [Ctrl]+[C] for copying to clipboard) --------------------------- OK ---------------------------

What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?

Frank

Frank you forgot past timings for REP STOSQ, they are in second message box.

I use 32mb buffer and 100 loops of test.

Alex
P.S. which timings of STOSQ?

frktons · September 05, 2010, 09:57:05 PM

I posted in the previous post Alex. Have a look.

REP/STOSQ looks a lot faster than MOVNTPD
at least on my machine.

Thanks again for doing the test. :clap:

Frank

News:

The fastest way to clear a buffer