News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

The fastest way to clear a buffer

Started by frktons, August 24, 2010, 08:47:34 PM

Previous topic - Next topic

Antariy

Quote from: frktons on September 05, 2010, 09:57:05 PM
I posted in the previous post Alex. Have a look.

REP/STOSQ looks a lot faster than MOVNTPD
at least on my machine.

Thanks again for doing the test.  :clap:

Frank

Yes, this behaviour is not wondering (what STOSQ faster).

Initially you don't post results for REP STOSQ :P
You add them later :)



Alex

frktons

Quote from: Antariy on September 05, 2010, 10:04:49 PM

Yes, this behaviour is not wondering (what STOSQ faster).

Initially you don't post results for REP STOSQ :P
You add them later :)
Alex


Yes Alex, because a MessageBox appeared, I didn't know that it would display
a second Message, so I posted the first result.  :P

On my CPU REP/STOSQ is 4:1 faster than MOVNTPD, and in some tests even more.
This was just an idea I had that X64 native code and RXX registers MOV are faster than
SSE2 for simple mov of data. And these tests seems to confirm that idea, thanks to you.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

#92
Frank, I compile this:


goasm /x64 asmfilename.asm
golink asmfilename.obj


Nothing more.

Run /? for apps, and see full help about params.

EDITED: Frank, I forgotten add this:
To link, need add names of DLLs which APIs is used to command line, so:

golink asmfilename.obj kernel32.dll user32.dll ... etc




Alex

frktons

Quote from: Antariy on September 05, 2010, 10:22:46 PM
Frank, I compile this:


goasm /x64 asmfilename.asm
golink asmfilename.obj


Nothing more.

Run /? for apps, and see full help about params.

Alex

Very good, thanks.

As I have some spare time I'll do some experiment on 64 bit code,
I think I'll enjoy it.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Frank, test app from this post. I commit memory before test, this can (must) gets better results in tests.

Don't forgot post timings :)



Alex

frktons

#95
The test produces these results on my CPU:

Clearing done
183292241 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
921076349 clocks for a 33554432 bytes buffer with using MOVNTDQ



REP/STOSQ is getting faster this way.

With these big numbers a thousand separator would help a lot:

Clearing done
183.292.241 clocks for a 33.554.432 bytes buffer with using REP STOSQ
Clearing done
921.076.349 clocks for a 33.554.432 bytes buffer with using MOVNTDQ

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Oh...

In last moment I add cpuid to test, but don't make all needed stuff for this... What hurry makes...


Frank, test this new one, please, which attached to post. Previous test is NOT right.



Alex

frktons


Clearing done
23712588 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
20154312 clocks for a 33554432 bytes buffer with using MOVNTDQ


that's it Alex, MOVNTDQ still faster than REP STOSQ

Frank

Mind is like a parachute. You know what to do in order to use it :-)

zemtex

I would use a macro to set spaces in a buffer. A macro can customize the needs for peak performance and it would also eliminate the unwanted push and pops. You could also pass a boolean parameter to the macro to tell if you need to conserve edi registers or not, that would also save instructions.

A macro can allow you to choose between different methods of doing it based on how fast each method is for the different data sizes. One method for size x-y, another method for size v-t etc. If the buffer is only one byte you could make the macro much faster, avvoiding unnecesary overhead.

Take advantage of macro's  :U
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

frktons

Quote from: zemtex on September 26, 2010, 06:39:05 PM
I would use a macro to set spaces in a buffer. A macro can customize the needs for peak performance and it would also eliminate the unwanted push and pops. You could also pass a boolean parameter to the macro to tell if you need to conserve edi registers or not, that would also save instructions.

A macro can allow you to choose between different methods of doing it based on how fast each method is for the different data sizes. One method for size x-y, another method for size v-t etc. If the buffer is only one byte you could make the macro much faster, avvoiding unnecesary overhead.

Take advantage of macro's  :U

Feel free to post any working example you like.  :U

Frank
Mind is like a parachute. You know what to do in order to use it :-)

xanatose

On my laptop, (macbook pro, using windows 7 64 bit) I get this results

For ClearBufferNew4.exe:

Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz (SSE4)
16.270.158      cycles for RtlZeroMemory
22.256.374      cycles for FrkTons
17.003.011      cycles for rep stosd
22.407.395      cycles for movdqa
21.586.957      cycles for movaps
20.894.627      cycles for FrkTons New
21.574.685      cycles for movups
21.463.688      cycles for movupd
8.449.814       cycles for MOVNTDQ


for clearbufx64_3.exe:

Clearing done
65323406 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
33619797 clocks for a 33554432 bytes buffer with using MOVNTDQ


I guess what is faster will depend on the machine.

Antariy

Quote from: xanatose on October 01, 2010, 01:17:55 AM
On my laptop, (macbook pro, using windows 7 64 bit) I get this results
.........
I guess what is faster will depend on the machine.

Hi!

Thanks for testing!

Just ClearBufferNew4.exe is 32bit app - and used 32bit REP STOSD (which is probably cached for making effective transaction with system bus), and clearbufx64_3.exe is 64bit app - and used 64bit REP STOSQ, which is probably not cached while writing progressed, because timings is very close with consideration of the same SSE2 algo which is write 128bits not-cached.



Alex