Hi all.
I'm asking myself how can I clear a buffer in the fastest possible way?
I've an area of 8,000 bytes and I want to fill it with spaces after having
used it for other purposes.
I Came up with this solution and I'm wondering if there are better and
faster ways to do it.
;----------------------------------------------------------------------
; Fast way for clearing [putting all spaces into] a
; structure CHAR_INFO totalling 8000 bytes.
;----------------------------------------------------------------------
; Author: frktons @ MASM32 forum
; Date: 24/aug/2010.
;----------------------------------------------------------------------
include \masm32\include\masm32rt.inc
ClearBuffer PROTO :DWORD
;----------------------------------------------------------------------
.data?
buf2clear CHAR_INFO 2000 dup (<>)
rHnd HANDLE ?
howmany dd ?
buffer INPUT_RECORD <>
.code
start:
Main PROC
INVOKE GetStdHandle, STD_INPUT_HANDLE
mov rHnd,eax
INVOKE ClearBuffer, ADDR buf2clear
print "Clearing done",13,10,13,10
print "Press any key to close...",13,10
CALL AnyKey
finish: INVOKE ExitProcess,0
ret
Main ENDP
; -------------------------------------------------------------------------
ClearBuffer PROC AddrBuffer:DWORD
mov eax, AddrBuffer
mov ecx, 1000
mov bl, 32
mov bh, bl
bswap ebx
mov bl, 32
mov bh, bl
cycle:
mov [eax], ebx
add eax, 4
mov [eax], ebx
add eax, 4
dec ecx
jnz cycle
ret
ClearBuffer ENDP
; -------------------------------------------------------------------------
;Returns: key code in buffer.KeyEvent.wVirtualKeyCode WORD size
; -------------------------------------------------------------------------
AnyKey PROC
again:
INVOKE ReadConsoleInput,rHnd,offset buffer,1,offset howmany
cmp buffer.EventType,KEY_EVENT
jnz again
cmp buffer.KeyEvent.bKeyDown,0
jz again
ret
AnyKey ENDP
; -------------------------------------------------------------------------
end start
Any improvement possible?
Thanks
just the other day, we were playing with RtlZeroMemory in one of the other threads
you might try that
.DATA
ValueOK db "Memory zeroed out.",0
Sample db "BOX",0
Storage db "Co-ordinates of the Ark of the Covenant are...",0
.data?
Storage1 db 256 dup(?)
.CODE
start:
invoke RtlZeroMemory, ADDR Storage, sizeof Storage ; in kernel32.inc
Well, he wants spaces, not zeroes, but a rep stosd is most probably the fastest way to fill an 8k buffer with spaces.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1252 cycles for RtlZeroMemory
1231 cycles for rep stosd.
Hi, Frank!
If change this code to:
ClearBuffer PROC AddrBuffer:DWORD
mov eax, AddrBuffer
mov ecx, 1000
mov ebx,20202020h ; change filling ebx to one command
cycle:
mov [eax], ebx
mov [eax+4], ebx
add eax, 8
dec ecx
jnz cycle
ret
ClearBuffer ENDP
This works?
Or this:
ClearBuffer PROC AddrBuffer:DWORD
mov edx,edi
mov ecx, 2000 <--- This is must be 2000. Thanks to Jochen!
mov edi, AddrBuffer
mov eax,20202020h
rep stosd
mov edi,edx
ret
ClearBuffer ENDP
Test this.
Alex
EDITED. Jochen try to confuse me with one bug. So, Jochen, this is not my code, and in real-time of online internet this is hard - concentrate. I make suggestion, and only. Frank can find solution himself. I only show the way (no need in filling ebx by part, maybe used small code to do).
Or, if you are not scared of SSE2:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1261 cycles for RtlZeroMemory
1233 cycles for rep stosd
1014 cycles for movdqa
Quote from: jj2007 on August 24, 2010, 09:58:45 PM
Or, if you are not scared of SSE2:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1261 cycles for RtlZeroMemory
1233 cycles for rep stosd
1014 cycles for movdqa
Jochen, not confuse Frank with your experience :) All knows, what you are very like SSE2. What about movaps?
Alex
Quote from: Antariy on August 24, 2010, 10:03:35 PMWhat about movaps?
Identical. One byte shorter, of course.
Quote from: dedndave on August 24, 2010, 08:55:47 PM
just the other day, we were playing with RtlZeroMemory in one of the other threads
you might try that
Hi Dave.
The name
RtlZeroMemory suggests this function clears to zero an area of memory.
It can be useful for other situations, here I need to clear to spaces [ASCII 32].
Quote from: Antariy on August 24, 2010, 09:56:01 PM
Hi, Frank!
If change this code to:
ClearBuffer PROC AddrBuffer:DWORD
mov eax, AddrBuffer
mov ecx, 1000
mov ebx,20202020h ; change filling ebx to one command
cycle:
mov [eax], ebx
mov [eax+4], ebx
add eax, 8
dec ecx
jnz cycle
ret
ClearBuffer ENDP
This works?
Or this:
ClearBuffer PROC AddrBuffer:DWORD
mov edx,edi
mov ecx, 2000 <--- This is must be 2000. Thanks to Jochen!
mov edi, AddrBuffer
mov eax,20202020h
rep stosd
mov edi,edx
ret
ClearBuffer ENDP
Test this.
Alex
EDITED. Jochen try to confuse me with one bug. So, Jochen, this is not my code, and in real-time of online internet this is hard - concentrate. I make suggestion, and only. Frank can find solution himself. I only show the way (no need in filling ebx by part, maybe used small code to do).
Thanks Alex, The first solution should gain some cycles compared to mine,
the second one using
stosd should be faster according to your comments,
I have to test it and to understand how
stosd works, it is the first time
I see it :P
Quote from: jj2007 on August 24, 2010, 09:58:45 PM
Or, if you are not scared of SSE2:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1261 cycles for RtlZeroMemory
1233 cycles for rep stosd
1014 cycles for movdqa
Hi Jochen,
if you post the code I can have a look at it.
I'm not scared of SSE2/3/4 but I don't know them so it could
be an occasion to get INTEL manuals working a little. :lol
And last but not least, how does my version performs, compared to:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1261 cycles for RtlZeroMemory
1233 cycles for rep stosd
1014 cycles for movdqa
How much faster these methods are compared to the first I posted?
Quote from: frktons on August 24, 2010, 11:00:51 PM
How much faster these methods are compared to the first I posted?
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1252 cycles for RtlZeroMemory
2024 cycles for FrkTons
1233 cycles for rep stosd
1014 cycles for movdqa
1013 cycles for movaps
Quote from: jj2007 on August 24, 2010, 11:12:56 PM
Quote from: frktons on August 24, 2010, 11:00:51 PM
How much faster these methods are compared to the first I posted?
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1252 cycles for RtlZeroMemory
2024 cycles for FrkTons
1233 cycles for rep stosd
1014 cycles for movdqa
1013 cycles for movaps
Thanks Jochen,
now I've an idea of the performance gap among the various methods.
Time to study them a little, tomorrow and the days to come. :U
On my pc I've these results:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1058 cycles for RtlZeroMemory
2022 cycles for FrkTons
1056 cycles for rep stosd
532 cycles for movdqa
531 cycles for movaps
1056 cycles for RtlZeroMemory
2318 cycles for FrkTons
1224 cycles for rep stosd
616 cycles for movdqa
613 cycles for movaps
--- ok ---
Interesting enough that
RtlZeroMemory a probably C/C++ function, is
2:1 faster than the handwritten elementary assembly version I coded. :P
I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.
Quote from: E^cube on August 25, 2010, 12:08:08 AM
I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.
Last post of that thread:
Quote from: hutch-- on December 05, 2009, 11:14:25 AM
REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster
1. Anybody who is able to use Olly will quickly find out that RtlZeroMemory is just a wrapper for rep stosd
2. We had
not beaten to death the option of doing it in SSE2 - that's why I added it. With success :green (but attention - this requires 16-byte alignment of the buffer).
Quote from: jj2007 on August 25, 2010, 06:30:46 AM
Quote from: E^cube on August 25, 2010, 12:08:08 AM
I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.
Last post of that thread:
Quote from: hutch-- on December 05, 2009, 11:14:25 AM
REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster
1. Anybody who is able to use Olly will quickly find out that RtlZeroMemory is just a wrapper for rep stosd
2. We had not beaten to death the option of doing it in SSE2 - that's why I added it. With success :green (but attention - this requires 16-byte alignment of the buffer).
I have a question about
RtlZeroMemory: Could we call it in some way so that this function
fills the buffer with a character of our choice or it just zeroes the area? is it parameterless?
By the way, the SSE2 solution you posted looks much faster than it, so why not use it in modern
machine? :P
Thanks
Frank,
have a play with REP STOSD, apart from SSE you will struggle to do much better.
Quote from: hutch-- on August 25, 2010, 08:52:18 AM
Frank,
have a play with REP STOSD, apart from SSE you will struggle to do much better.
Certainly I'll do play a little with it, and with some SSE as well afterwhile. My machine is able to
do so many things I don't even suspect :P
AMD Phenom(tm) II X6 1055T Processor (SSE3)
557 cycles for RtlZeroMemory
2012 cycles for FrkTons
549 cycles for rep stosd
1509 cycles for movdqa
1509 cycles for movaps
556 cycles for RtlZeroMemory
3014 cycles for FrkTons
549 cycles for rep stosd
1016 cycles for movdqa
1015 cycles for movaps
Really surprising, Rockoon. There seem to be huge differences in the way rep stosd is implemented.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
2515 cycles for RtlZeroMemory
4300 cycles for FrkTons
2486 cycles for rep stosd
2491 cycles for movdqa
2387 cycles for movaps
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
1055 cycles for RtlZeroMemory
2018 cycles for FrkTons
1047 cycles for rep stosd
531 cycles for movdqa
531 cycles for movaps
1055 cycles for RtlZeroMemory
2026 cycles for FrkTons
1048 cycles for rep stosd
521 cycles for movdqa
519 cycles for movaps
--- ok ---
REP STOSD is simple enough
i have to ask, though
why do you want to clear the char buffer ? - lol
won't it be filled in by the next read/fill operation ?
Quote from: dedndave on August 25, 2010, 03:36:27 PM
REP STOSD is simple enough
i have to ask, though
why do you want to clear the char buffer ? - lol
won't it be filled in by the next read/fill operation ?
Yes sir, it'll be filled with the next operation, but not my curiosity :P
And while we are here, I tried to use some SSE mnemonics to do something
different, because the ability to use 16 bytes register allures me a lot, but those
nasty little endians make me crazy:
;----------------------------------------------------------------------
; Fast way for reversing a 16 bytes string with SSE instructions.
;----------------------------------------------------------------------
; Author: frktons @ MASM32 forum
; Date: 25/aug/2010.
;----------------------------------------------------------------------
include \masm32\include\masm32rt.inc
.686
.xmm
;----------------------------------------------------------------------
.data
align 16
str1 db "0123456789ABCDEF",0 ; original string
ptr_str1 dd str1 ; pointer to the string
align 16
str2 db " ",0 ; reversed string
ptr_str2 dd str2 ; pointer to reversed string
imm8 db 27 ; bit pattern 00011011 used by pshufd to reverse
; the order of the 4 DW of an xmm register
;----------------------------------------------------------------------
.data?
rHnd HANDLE ?
howmany dd ?
buffer INPUT_RECORD <>
.code
start:
Main PROC
INVOKE GetStdHandle, STD_INPUT_HANDLE
mov rHnd,eax
print "original string: "
print ptr_str1,13,10,13,10
CALL rev_sse2
print "reversed string: "
print ptr_str2,13,10,13,10
CALL AnyKey
finish: INVOKE ExitProcess,0
ret
Main ENDP
; -------------------------------------------------------------------------
rev_sse2 PROC
mov eax, ptr_str1
mov ebx, ptr_str2
movdqa xmm0, [eax]
pshufd xmm1, xmm0, 27
movdqa [ebx], xmm1
ret
rev_sse2 ENDP
; -------------------------------------------------------------------------
;Returns: key code in buffer.KeyEvent.wVirtualKeyCode WORD size
; -------------------------------------------------------------------------
AnyKey PROC
again:
INVOKE ReadConsoleInput,rHnd,offset buffer,1,offset howmany
cmp buffer.EventType,KEY_EVENT
jnz again
cmp buffer.KeyEvent.bKeyDown,0
jz again
ret
AnyKey ENDP
; -------------------------------------------------------------------------
end start
gives me not what I want, the reversed string, but something a bit
different:
original string: 0123456789ABCDEF
reversed string: CDEF89AB45670123
aren't those little endians nasty enough?
Or is my n00b-iness that is big [endian] enough? :lol
i don't think SSE give you a way to reverse bytes in a dword
the BSWAP instruction does that, though
;EAX = 12345678h
bswap eax
;EAX = 78563412h
if you want to swap nybbles, that's another story
a while back, we were playing with reversing all the bits in a dword register
there was a rather ineresting algo for that
Quote from: dedndave on August 25, 2010, 03:46:31 PM
i don't think SSE give you a way to reverse bytes in a dword
the BSWAP instruction does that, though
Yes Master, I remember the old lesson about
bswap that you and
Jochen gave me some time ago. I was just experimenting this opportunity
of SSE mnemonics. Maybe there is even a way to reverse the all with SSE
but I actually don't know ::)
You can reverse 16 bytes with a single instruction called pshufb, but it's SSE4.
Quote from: jj2007 on August 25, 2010, 05:28:18 PM
You can reverse 16 bytes with a single instruction called pshufb, but it's SSE4.
Thanks Jochen. I'll wait until the next CPU then. :P
I started a new thread on 64 bit section because
rep stosd was considered the fastest way to inizialize a block of
memory in 32 bit assembly.
Now SSE instructions beat it on INTEL machine at least.
It's my opinion that in 64 bit machines, working with 64 bit native operations,
we could get better results than SSE mnemonics just using general 64 bit registers.
To prove it I need the rep stosd version translated into 64 bit assembly
and tested.
Anyone wants to engage?
Quote from: frktons on August 25, 2010, 08:15:47 PM
Anyone wants to engage?
What about you? I'll give you a starting point:
mov rax, 20202020202020202020202020202020h
mov rdi, offset buffer
mov rcx, 1000
rep stosd
I can't test it because my OS and CPU are 32 bit. Now don't be shy, just go ahead!
it's not REP STOSQ ???
Quote from: jj2007 on August 25, 2010, 08:30:53 PM
Quote from: frktons on August 25, 2010, 08:15:47 PM
Anyone wants to engage?
What about you? I'll give you a starting point:
mov rax, 20202020202020202020202020202020h
mov rdi, offset buffer
mov rcx, 1000
rep stosd
I can't test it because my OS and CPU are 32 bit. Now don't be shy, just go ahead!
Thanks Jochen. I'll gladly try it if you tell me how do I compile it?
Is MASM32 enough or have I to use any other tool?
And a last question:
.686
.xmm
are enough or have I to specify something else?
Quote from: frktons on August 25, 2010, 08:58:46 PM
Is MASM32 enough or have I to use any other tool?
JWasm (http://www.japheth.de/JWasm/Win64_1.html) is the best option, but I can't tell you more since my OS is 32 bit.
Well, I asked in 64 bit subforum because I'm not ready for 64 bit assembling.
My OS is 64 bit and my machine too, but I know too little to do it myself, and
I don't even have a clue on how to use GoAsm or JWasm or ML64. ::)
By the way, instead of RtlZeroMemory it's probably better to use this MACRO from Microsoft
to accomplish the task of filling a block of memory:
void FillMemory(
[out] PVOID Destination,
[in] SIZE_T Length,
[in] BYTE Fill
);
What do you think?
The FillMemory macro calls the RtlFillMemory function.
You will find the following in windows.inc:
FillMemory EQU RtlFillMemory
I would just call RtlFillMemory instead of FillMemory just to keep things straightforward.
RtlZeroMemory fills memory with zeros, RtlFillMemory is for filling memory with other characters.
i think i would use REP STOSD/Q
or have we forgotten how to write ASM ?
Quote from: Greg Lyon on August 25, 2010, 11:29:46 PM
The FillMemory macro calls the RtlFillMemory function.
You will find the following in windows.inc:
FillMemory EQU RtlFillMemory
I would just call RtlFillMemory instead of FillMemory just to keep things straightforward.
RtlZeroMemory fills memory with zeros, RtlFillMemory is for filling memory with other characters.
Thanks Greg, this is what I meant. There is this MACRO from Microsoft that is quite efficient, and
is an implementation of
rep stosdQuote from: dedndave on August 25, 2010, 11:38:26 PM
i think i would use REP STOSD/Q
or have we forgotten how to write ASM ?
You are right, Master. Only one thing to meditate upon:
REP STOSQ is only implemented
inside VISUAL C++, and
Quote
This routine is only available as an intrinsic
And in Assembly as INTEL says:
Quote
In 64-bit mode, the default address size is 64 bits, 32-bit address size is supported using the prefix 67H. Using a REX prefix in the form of REX.W promotes operation on doubleword operand to 64 bits. The promoted no-operand mnemonic is STOSQ. STOSQ (and its explicit operands variant) store a quadword from the RAX register into the destination addressed by RDI or EDI.
That is quite obscure meaning for a "premium n00b" of my level :P
My results with RtlFillMemory added to the testbed:
Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1057 cycles for RtlZeroMemory
2017 cycles for FrkTons
1050 cycles for rep stosd
568 cycles for movdqa
556 cycles for movaps
1098 cycles for RtlFillMemory
1078 cycles for RtlZeroMemory
2047 cycles for FrkTons
1052 cycles for rep stosd
555 cycles for movdqa
548 cycles for movaps
1111 cycles for RtlFillMemory
--- ok ---
In the "sloppy category" my routine beats anybody's else. :P
give yourself some credit Frank - lol
STOSQ stores a qword (64 bit value)
REP STOSQ is probably fast as hell on 64-bit machines
Jochen showed REP STOSD in his 64-bit example code - that was probably just a small oversight
i am sure he meant REP STOSQ
RtlZeroMemory probably preserves ESI (or RSI), but other than that, it is straightforward
you can assume that the direction flag is clear, as it should be - i am sure RtlZeroMemory also makes that assumption
so....
load the value you want repeated into EAX/RAX
load the repeat count into ECX/RCX
load the address into EDI/RDI
then do the REP STOSD or REP STOSQ
it will be quote fast as long as the address is 4-byte-aligned for 32-bit code or 8-byte-aligned for 64-bit code
Quote from: dedndave
i think i would use REP STOSD/Q
or have we forgotten how to write ASM ?
Dave,
I new someone was going to say something like that. I was only commenting on the use of FillMemory, not on what was fastest to do the job.
you're right, of course, Greg
many of these functions were written for C-programmers :lol
Dave,
So are you saying I don't know how to write ASM code?
no - no - not at all, Greg - lol
you're tops :U
Dave,
No, I'm definitely not tops. Smartass.
Quote from: Greg Lyon on August 26, 2010, 01:48:52 AM
Dave,
No, I'm definitely not tops. Smartass.
heh you must not be aware of Dave's humor, shame.
Also to the OP, if you're just trying to "clear" a buffer to use with an ascii string you can
lea edi,buffer
mov byte ptr [edi],0
and lstcpy etc.. will consider it empty.
Quote from: E^cubeheh you must not be aware of Dave's humor, shame.
Oh, I'm fully aware of it.
Ahh I remember you, you were insulting me in PM :tdown I see you're playing nice with others too... :snooty:
i meant that, Greg - i wish i knew all the stuff you guys know
i might learn some of it, if i had more time, too
i had planned on spending the entire summer learning more about win32 code
as it turned out, i didn't get to spend hardly any time on code
this weekend, it looks like i am off to Michigan to remodel a house, too - lol
by the time the day is done, i will be too tired to concentrate on learning anything new
maybe by the time christmas rolls around......
Quote from: E^cube on August 26, 2010, 01:55:32 AM
To the OP, if you're just trying to "clear" a buffer to use with an ascii string you can
lea edi,buffer
mov byte ptr [edi],0
and lstcpy etc.. will consider it empty.
I'm trying to fill a buffer with spaces [ASCII 32] and If somebody can translate
and compile in 64 bit Assembly the equivalent of:
push edi
mov ecx, 2000
mov edi, offset Dest
mov eax, 20202020h
rep stosd
pop edi
I'll have a more detailed idea of how 64 bit native registers compare to
SSE instructions. My machine and OS are 64 bit, I'm not yet. :'(
E^cube,
No, you have that backwards E^cube, it was you sending me nasty PMs. And you got a chance to take a jab at me and you sure took advantage of it didn't you?
Dave,
No problem Dave, I guess I'm just a little on edge tonight, I'm sorry.
Quote from: dedndave on August 26, 2010, 01:21:33 AM
give yourself some credit Frank - lol
STOSQ stores a qword (64 bit value)
REP STOSQ is probably fast as hell on 64-bit machines
Jochen showed REP STOSD in his 64-bit example code - that was probably just a small oversight
i am sure he meant REP STOSQ
RtlZeroMemory probably preserves ESI (or RSI), but other than that, it is straightforward
you can assume that the direction flag is clear, as it should be - i am sure RtlZeroMemory also makes that assumption
so....
load the value you want repeated into EAX/RAX
load the repeat count into ECX/RCX
load the address into ESI/RSI
then do the REP STOSD or REP STOSQ
it will be quote fast as long as the address is 4-byte-aligned for 32-bit code or 8-byte-aligned for 64-bit code
Well, Dave I got the 32 bit version of
rep stosd, it is not that difficult.
I'm trying to see in 64 bit how it works, and I don't have a clue on how to do it
because I have no 64 bit experience and tools at all. ::)
So I'm asking somebody to translate and compile in 64 bit mode to test the
performance it gets.
frktons,
I don't think anyone has written timing routines in x64 yet.
Quote from: Greg Lyon on August 26, 2010, 02:25:52 AM
frktons,
I don't think anyone has written timing routines in x64 yet.
Time to start? Well, I can be satisfied just seeing how it translate into 64 bit.
It shouldn't be too complex:
push edi
mov ecx, 2000
mov edi, offset Dest
mov eax, 20202020h
rep stosd
pop edi
::)
Jochen suggested to start from:
mov rax, 20202020202020202020202020202020h
mov rdi, offset Dest
mov rcx, 1000
rep stosq
And it is quite clear, but what about all the rest of the program?
And moreover is this a legal 64 bit syntax that a 64 bit assembler can assemble?
Which one? JWasm, GoAsm, ML64?
push rdi
mov rcx,1000
mov rdi,offset Dest
mov rax,2020202020202020h
rep stosq
pop rdi
i have no way to test it :P
JwAsm will assemble it for you
you could clear a larger area using 32-bit and 64-bit, then use a stopwatch :lol
Quote from: dedndave on August 26, 2010, 02:35:11 AM
push rdi
mov rcx,1000
mov rdi,offset Dest
mov rax,2020202020202020h
rep stosq
pop rdi
i have no way to test it :P
JwAsm will assemble it for you
oh, I was writing while you replied. Have a look at my prev post Master.
i think Jochen has a few too many spaces in there, Frank :bg
16 spaces = 128 bits in a 64-bit reg
overflow !!! - lol
Quote from: dedndave on August 26, 2010, 02:38:13 AM
i think Jochen has a few too many spaces in there, Frank :bg
16 spaces = 128 bits in a 64-bit reg
overflow !!! - lol
You are right, those were 128 bit xmm registers He used with SSE mnemonics and probably
forgot we were going to native 64 bit registers rxx.
Me too :P
2020202020202020h
Folks, if I remember well, Hutch knows a revolutionary mathematical technology to count the spaces in this expression :bg
:bg
Huh ?
In my own case "revolutionary" and "mathematical" are not compatible in the same sentence. I freely admit to "Eenie meanie minie moe" technology (fingers) and have to cheat and use computers to add up numbers.
Quote from: hutch-- on August 26, 2010, 07:06:51 AM
:bg
Huh ?
In my own case "revolutionary" and "mathematical" are not compatible in the same sentence. I freely admit to "Eenie meanie minie moe" technology (fingers) and have to cheat and use computers to add up numbers.
I meant the "Eenie meanie minie moe" technology. It is quite sufficient to see that there are 8 spaces in 2020202020202020h, not 16 as Dave suspected :wink
:bg
Dave is probably cheating and using both hands.
Quote from: jj2007 on August 25, 2010, 08:30:53 PM
What about you? I'll give you a starting point:
mov rax, 20202020202020202020202020202020h
mov rdi, offset buffer
mov rcx, 1000
rep stosd
i'd say that either Jochen has had one too many cappuccinos or his space bar is stuck :lol
note: it should also be STOSQ - not STOSD
http://www.masm32.com/board/index.php?topic=14685.msg119244#msg119244
i cheated and used both hands to create this post
Quote from: dedndave on August 26, 2010, 12:13:25 PM
Quote from: jj2007 on August 25, 2010, 08:30:53 PM
What about you? I'll give you a starting point:
mov rax, 20202020202020202020202020202020h
mov rdi, offset buffer
mov rcx, 1000
rep stosd
i'd say that either Jochen has had one too many cappuccinos or his space bar is stuck :lol
note: it should also be STOSQ - not STOSD
Oops, you are right! I had looked at replies 51 & 52 and saw exactly 8 spaces in the code... but that was your code, not mine :red
Apologies :thumbu
no prob JJ :P
it feels good to catch you, once in a while
After some experimentation I got these results:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
2059 cycles for RtlZeroMemory
4023 cycles for FrkTons
2070 cycles for rep stosd
1062 cycles for movdqa
1062 cycles for movaps
1024 cycles for FrkTons New
5023 cycles for movups
5050 cycles for movupd
2087 cycles for RtlZeroMemory
4043 cycles for FrkTons
2064 cycles for rep stosd
1038 cycles for movdqa
1050 cycles for movaps
1016 cycles for FrkTons New
5036 cycles for movups
5042 cycles for movupd
--- ok ---
How can it be possible?
The new test attached.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2293 cycles for RtlZeroMemory
4029 cycles for FrkTons
2272 cycles for rep stosd
2017 cycles for movdqa
2018 cycles for movaps
2140 cycles for FrkTons New
6026 cycles for movups
6021 cycles for movupd
Can't see any surprises in here ::)
Quote from: jj2007 on September 03, 2010, 08:44:08 PM
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2293 cycles for RtlZeroMemory
4029 cycles for FrkTons
2272 cycles for rep stosd
2017 cycles for movdqa
2018 cycles for movaps
2140 cycles for FrkTons New
6026 cycles for movups
6021 cycles for movupd
Can't see any surprises in here ::)
I should have imagined that I did something wrong :P
Maybe if you use an older CPU the program wouldn't even run :lol
Hi, Frank!
This is results on my CPU:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
5144 cycles for RtlZeroMemory
8253 cycles for FrkTons
4952 cycles for rep stosd
4790 cycles for movdqa
4801 cycles for movaps
4862 cycles for FrkTons New
10594 cycles for movups
10601 cycles for movupd
4960 cycles for RtlZeroMemory
8383 cycles for FrkTons
4952 cycles for rep stosd
4820 cycles for movdqa
4795 cycles for movaps
4861 cycles for FrkTons New
10598 cycles for movups
10602 cycles for movupd
Alex
Quote from: frktons on September 03, 2010, 09:25:34 PM
Maybe if you use an older CPU the program wouldn't even run :lol
You would need a very old CPU.
On mine, you can save 40 cycles with a small modification:
mov edx, offset Dest
lea ecx, [edx+16000]
mov eax, 20202020h
movd xmm0, eax
pshufd xmm0, xmm0, 0
; movdqa xmm1, xmm0
; movdqa xmm2, xmm0
; movdqa xmm3, xmm0
; movdqa xmm4, xmm0
@@:
movdqa [edx], xmm0
movdqa [edx + 16], xmm0
movdqa [edx + 32], xmm0
movdqa [edx + 48], xmm0
movdqa [edx + 64], xmm0
add edx, 80
cmp edx, ecx
jl @B
And you can save some bytes, if use MOVAPS for moving to regs and to memory :)
Alex
Quote from: jj2007 on September 03, 2010, 10:50:43 PM
Quote from: frktons on September 03, 2010, 09:25:34 PM
Maybe if you use an older CPU the program wouldn't even run :lol
You would need a very old CPU.
On mine, you can save 40 cycles with a small modification:
mov edx, offset Dest
lea ecx, [edx+16000]
mov eax, 20202020h
movd xmm0, eax
pshufd xmm0, xmm0, 0
; movdqa xmm1, xmm0
; movdqa xmm2, xmm0
; movdqa xmm3, xmm0
; movdqa xmm4, xmm0
@@:
movdqa [edx], xmm0
movdqa [edx + 16], xmm0
movdqa [edx + 32], xmm0
movdqa [edx + 48], xmm0
movdqa [edx + 64], xmm0
add edx, 80
cmp edx, ecx
jl @B
I already tested this kind of unrolling, but the best performance on Core 2 duo
happens with 5 different XMM registers. The CPU architecture plays the big
role for the 20-50 cycles difference. Not that much anyway. In my opinion it's
just the cache memory that gives some extra speed on Core 2. I'd like to see
what these routines gain or loose on the more recent quad/i3-i7 machines as well.
If anyone has got this newest kind of CPU.
Frank
Quote from: Antariy on September 03, 2010, 10:54:27 PM
And you can save some bytes, if use MOVAPS for moving to regs and to memory :)
Alex
Yes Alex. I'm testing just the speed and MOVDQA looks a little bit faster than MOVAPS.
It's a very tiny difference indeed. At least on my machine.
Frank
Frank, on i3-i7 "rep stosd" must work faster on equally as SSE-code. I cannot check this.
What if write to memory with this command: "MOVNTDQ" - non-temporal write to memory - without caching.
Edited:
timings with direct writing:
13705 cycles for FrkTons New with MOVNTDQ
13819 cycles for FrkTons New with MOVNTPD
13787 cycles for FrkTons New with MOVNTPS
Other timings omited, because I have posted it already.
Alex
Quote from: Antariy on September 03, 2010, 11:07:59 PM
Frank, on i3-i7 "rep stosd" must work faster on equally as SSE-code. I cannot check this.
What if write to memory with this command: "MOVNTDQ" - non-temporal write to memory - without caching.
Alex
Alex MOVNTDQ is quite slow on my machine:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
2059 cycles for RtlZeroMemory
4029 cycles for FrkTons
2047 cycles for rep stosd
1033 cycles for movdqa
1032 cycles for movaps
1017 cycles for FrkTons New
5024 cycles for movups
5020 cycles for movupd
8047 cycles for MOVNTDQ
2090 cycles for RtlZeroMemory
4046 cycles for FrkTons
2062 cycles for rep stosd
1052 cycles for movdqa
1047 cycles for movaps
1017 cycles for FrkTons New
5036 cycles for movups
5042 cycles for movupd
7864 cycles for MOVNTDQ
--- ok ---
I suppose that if you could do:
rep/stosq with rxx 64 bit register the results would be similar or better
than SSE2 instructions. But nobody has taken the task to compile with a 64 bit
assembler. I'll probably do it when I'll be more familiar with JWASM.
For the time being I don't find the time to study also that :P
Frank
Yes, I post my results for non-temporal writes also (in post, which ask for this).
MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.
Alex
Quote from: Antariy on September 03, 2010, 11:20:37 PM
Yes, I post my results for non-temporal writes also (in post, which ask for this).
MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.
Alex
I'm trying with 16MB, but the program is taking a lot of time to compile ::)
Will it ever end?
Quote from: frktons on September 03, 2010, 11:34:53 PM
Quote from: Antariy on September 03, 2010, 11:20:37 PM
Yes, I post my results for non-temporal writes also (in post, which ask for this).
MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.
Alex
I'm trying with 16MB, but the program is taking a lot of time to compile ::)
Will it ever end?
This is known problem - in-exe allocation of data - maybe, "bug" of MASM. Try JWasm.
Or try allocate buffer with using of heap functions (like GlobalAlloc etc). Allocate 32 or 16MB buffer in heap, and use it in tests.
Alex
Quote from: Antariy on September 03, 2010, 11:46:07 PM
This is known problem - in-exe allocation of data - maybe, "bug" of MASM. Try JWasm.
Or try allocate buffer with using of heap functions (like GlobalAlloc etc). Allocate 32 or 16MB buffer in heap, and use it in tests.
Alex
All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.
Frank
Quote from: frktons on September 03, 2010, 11:52:45 PM
All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.
Frank
For 16MB buffer
invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024
in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.
Alex
Quote from: Antariy on September 03, 2010, 11:57:37 PM
Quote from: frktons on September 03, 2010, 11:52:45 PM
All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.
Frank
For 16MB buffer
invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024
in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.
Alex
Alex, is this enough or have I to change something else?
.data?
align 16
; Dest db 16000000 dup(?) ; <------ don't use it anymore
DataPtr dd ? ; <-------------- Pointer for data allocated
.code
start:
push 1
call ShowCpu ; print brand string and SSE level
invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024
mov DataPtr, eax
REPEAT 2
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke RtlZeroMemory, DataPtr, 16000000 <----------------- is this use of DataPtr correct?
counter_end
Alex you were right, with big buffer I have these results:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
11628116 cycles for RtlZeroMemory
15883968 cycles for FrkTons
10996290 cycles for rep stosd
15473203 cycles for movdqa
15480211 cycles for movaps
15471071 cycles for FrkTons New
15477872 cycles for movups
15445525 cycles for movupd
8082000 cycles for MOVNTDQ
10999930 cycles for RtlZeroMemory
15870714 cycles for FrkTons
11012185 cycles for rep stosd
15418317 cycles for movdqa
15427633 cycles for movaps
15416041 cycles for FrkTons New
15418995 cycles for movups
15415995 cycles for movupd
8162844 cycles for MOVNTDQ
--- ok ---
and rep/stosd is faster than sse2 instructions.
I modified the number of cycles to perform the test to 1,000
instead of 1 million, to make it shorter.
Frank
Absent a newer build here's the result from the last one
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
6354 cycles for RtlZeroMemory
10341 cycles for FrkTons
6335 cycles for rep stosd
4143 cycles for movdqa
4108 cycles for movaps
1667 cycles for FrkTons New
8220 cycles for movups
8340 cycles for movupd
6296 cycles for RtlZeroMemory
10333 cycles for FrkTons
6265 cycles for rep stosd
4117 cycles for movdqa
4153 cycles for movaps
1675 cycles for FrkTons New
8227 cycles for movups
8232 cycles for movupd
Core Solo
Genuine Intel(R) CPU T1350 @ 1.86GHz (SSE3)
2319 cycles for RtlZeroMemory
4072 cycles for FrkTons
2305 cycles for rep stosd
2039 cycles for movdqa
2039 cycles for movaps
2155 cycles for FrkTons New
6098 cycles for movups
6088 cycles for movupd
2315 cycles for RtlZeroMemory
4082 cycles for FrkTons
2296 cycles for rep stosd
2038 cycles for movdqa
2038 cycles for movaps
2164 cycles for FrkTons New
6087 cycles for movups
6095 cycles for movupd
Quote from: clive on September 04, 2010, 12:38:21 AM
Absent a newer build here's the result from the last one
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
6354 cycles for RtlZeroMemory
10341 cycles for FrkTons
6335 cycles for rep stosd
4143 cycles for movdqa
4108 cycles for movaps
1667 cycles for FrkTons New
8220 cycles for movups
8340 cycles for movupd
6296 cycles for RtlZeroMemory
10333 cycles for FrkTons
6265 cycles for rep stosd
4117 cycles for movdqa
4153 cycles for movaps
1675 cycles for FrkTons New
8227 cycles for movups
8232 cycles for movupd
Core Solo
Genuine Intel(R) CPU T1350 @ 1.86GHz (SSE3)
2319 cycles for RtlZeroMemory
4072 cycles for FrkTons
2305 cycles for rep stosd
2039 cycles for movdqa
2039 cycles for movaps
2155 cycles for FrkTons New
6098 cycles for movups
6088 cycles for movupd
2315 cycles for RtlZeroMemory
4082 cycles for FrkTons
2296 cycles for rep stosd
2038 cycles for movdqa
2038 cycles for movaps
2164 cycles for FrkTons New
6087 cycles for movups
6095 cycles for movupd
Wooops, The Atom really likes working with many XMM register at a time.
You are right clive I didn't post the new buid, so here it is. :U
And for readability purpose here we have a version that formats with
thousand separator the results of elapsed CPU cycles.
This version tests a buffer of 16MB to fill.
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
11.967.493 cycles for RtlZeroMemory
15.682.875 cycles for FrkTons
10.971.464 cycles for rep stosd
15.418.911 cycles for movdqa
15.435.221 cycles for movaps
15.409.998 cycles for FrkTons New
15.405.469 cycles for movups
15.518.687 cycles for movupd
8.056.812 cycles for MOVNTDQ
11.051.772 cycles for RtlZeroMemory
15.535.943 cycles for FrkTons
10.997.179 cycles for rep stosd
15.467.940 cycles for movdqa
15.457.092 cycles for movaps
15.485.439 cycles for FrkTons New
15.514.719 cycles for movups
15.513.319 cycles for movupd
8.053.411 cycles for MOVNTDQ
--- ok ---
attached the "improved version". :P
In my humble n00b-ist opinion, when we'll use REP/STOSQ in 64 bit
native OS with x64 machines,
it is going to win everything else. Not sure about MOVNTDQ anyway.
Frank
AMD Phenom(tm) II X6 1055T Processor (SSE3)
10.598.549 cycles for RtlZeroMemory
11.260.417 cycles for FrkTons
10.182.613 cycles for rep stosd
10.131.907 cycles for movdqa
10.115.035 cycles for movaps
10.214.832 cycles for FrkTons New
9.915.582 cycles for movups
10.188.273 cycles for movupd
6.888.405 cycles for MOVNTDQ
10.199.509 cycles for RtlZeroMemory
11.050.508 cycles for FrkTons
10.192.022 cycles for rep stosd
10.131.227 cycles for movdqa
10.113.104 cycles for movaps
10.217.227 cycles for FrkTons New
9.952.748 cycles for movups
10.184.520 cycles for movupd
6.700.808 cycles for MOVNTDQ
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
15.891.933 cycles for RtlZeroMemory
23.350.696 cycles for FrkTons
15.800.872 cycles for rep stosd
23.825.786 cycles for movdqa
23.886.914 cycles for movaps
23.872.424 cycles for FrkTons New
23.760.942 cycles for movups
23.730.159 cycles for movupd
9.818.186 cycles for MOVNTDQ
Quote from: jj2007 on September 04, 2010, 04:51:36 PM
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
15.891.933 cycles for RtlZeroMemory
23.350.696 cycles for FrkTons
15.800.872 cycles for rep stosd
23.825.786 cycles for movdqa
23.886.914 cycles for movaps
23.872.424 cycles for FrkTons New
23.760.942 cycles for movups
23.730.159 cycles for movupd
9.818.186 cycles for MOVNTDQ
Well for big buffers, like Alex said, MOVNTDQ is faster than anything else
on any machine, according to the tests done so far.
The Atom of Clive was really impressive regarding the multiple use of
XMM registers. I hope he'll post his results for this test as well.
Ok, this is from the original Acer Aspire One, I should try it on the newer one with the N450 CPU.
From ClearBufferNew3
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
23855499 cycles for RtlZeroMemory
23614698 cycles for FrkTons
23448786 cycles for rep stosd
23485010 cycles for movdqa
23496468 cycles for movaps
23578524 cycles for FrkTons New
23528033 cycles for movups
23437548 cycles for movupd
9166934 cycles for MOVNTDQ
23551485 cycles for RtlZeroMemory
23568079 cycles for FrkTons
23531873 cycles for rep stosd
23494767 cycles for movdqa
23480850 cycles for movaps
23560303 cycles for FrkTons New
23521236 cycles for movups
23485775 cycles for movupd
9156989 cycles for MOVNTDQ
From ClearBufferNew4
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
23.621.570 cycles for RtlZeroMemory
23.548.342 cycles for FrkTons
23.633.944 cycles for rep stosd
23.482.208 cycles for movdqa
23.584.646 cycles for movaps
23.504.980 cycles for FrkTons New
23.561.681 cycles for movups
23.543.870 cycles for movupd
9.184.061 cycles for MOVNTDQ
23.566.672 cycles for RtlZeroMemory
23.507.384 cycles for FrkTons
23.601.428 cycles for rep stosd
23.489.780 cycles for movdqa
23.512.724 cycles for movaps
23.516.591 cycles for FrkTons New
23.549.450 cycles for movups
23.596.780 cycles for movupd
9.156.032 cycles for MOVNTDQ
The Atom is again a surprise :dazzled:
It always has terrific results, for the bad or for the good.
Thanks clive
Frank
This is trying make x64 app under machine which runs x64 code very badly :P
Post results, please - Frank wait for this very long time.
Frank, this is compiled with GoAsm.
Alex
Hi Alex, thanks for doing this test.
The results on my machine are:
---------------------------
MOVNTPD
---------------------------
Clocks: 4.090.321.879
(Press [Ctrl]+[C] for copying to clipboard)
---------------------------
OK
---------------------------
---------------------------
REP STOSQ
---------------------------
Clocks: 1.977.888.757
(Press [Ctrl]+[C] for copying to clipboard)
---------------------------
OK
---------------------------
What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?
Well these results seem to confirm my suspect that on 64 bit machine
REP/STOSQ is the fastest buffer filler instruction for the time being :U
May I know how did you compile it?
I can download GoASM and try to make some experiments.
Frank
Quote from: frktons on September 05, 2010, 09:47:46 PM
Hi Alex, thanks for doing this test.
The results on my machine are:
---------------------------
MOVNTPD
---------------------------
Clocks: 4.090.321.879
(Press [Ctrl]+[C] for copying to clipboard)
---------------------------
OK
---------------------------
What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?
Frank
Frank you forgot past timings for REP STOSQ, they are in second message box.
I use 32mb buffer and 100 loops of test.
Alex
P.S. which timings of STOSQ?
I posted in the previous post Alex. Have a look.
REP/STOSQ looks a lot faster than MOVNTPD
at least on my machine.
Thanks again for doing the test. :clap:
Frank
Quote from: frktons on September 05, 2010, 09:57:05 PM
I posted in the previous post Alex. Have a look.
REP/STOSQ looks a lot faster than MOVNTPD
at least on my machine.
Thanks again for doing the test. :clap:
Frank
Yes, this behaviour is not wondering (what STOSQ faster).
Initially you don't post results for REP STOSQ :P
You add them later :)
Alex
Quote from: Antariy on September 05, 2010, 10:04:49 PM
Yes, this behaviour is not wondering (what STOSQ faster).
Initially you don't post results for REP STOSQ :P
You add them later :)
Alex
Yes Alex, because a MessageBox appeared, I didn't know that it would display
a second Message, so I posted the first result. :P
On my CPU REP/STOSQ is 4:1 faster than MOVNTPD, and in some tests even more.
This was just an idea I had that X64 native code and RXX registers MOV are faster than
SSE2 for simple mov of data. And these tests seems to confirm that idea, thanks to you.
Frank
Frank, I compile this:
goasm /x64 asmfilename.asm
golink asmfilename.obj
Nothing more.
Run /? for apps, and see full help about params.
EDITED: Frank, I forgotten add this:
To link, need add names of DLLs which APIs is used to command line, so:
golink asmfilename.obj kernel32.dll user32.dll ... etc
Alex
Quote from: Antariy on September 05, 2010, 10:22:46 PM
Frank, I compile this:
goasm /x64 asmfilename.asm
golink asmfilename.obj
Nothing more.
Run /? for apps, and see full help about params.
Alex
Very good, thanks.
As I have some spare time I'll do some experiment on 64 bit code,
I think I'll enjoy it.
Frank
Frank, test app from this post. I commit memory before test, this can (must) gets better results in tests.
Don't forgot post timings :)
Alex
The test produces these results on my CPU:
Clearing done
183292241 clocks for a 33554432 bytes buffer with using REP STOSQ
Clearing done
921076349 clocks for a 33554432 bytes buffer with using MOVNTDQ
REP/STOSQ is getting faster this way.
With these big numbers a thousand separator would help a lot:
Clearing done
183.292.241 clocks for a 33.554.432 bytes buffer with using REP STOSQ
Clearing done
921.076.349 clocks for a 33.554.432 bytes buffer with using MOVNTDQ
Frank
Oh...
In last moment I add cpuid to test, but don't make all needed stuff for this... What hurry makes...
Frank, test this new one, please, which attached to post. Previous test is NOT right.
Alex
Clearing done
23712588 clocks for a 33554432 bytes buffer with using REP STOSQ
Clearing done
20154312 clocks for a 33554432 bytes buffer with using MOVNTDQ
that's it Alex, MOVNTDQ still faster than REP STOSQ
Frank
I would use a macro to set spaces in a buffer. A macro can customize the needs for peak performance and it would also eliminate the unwanted push and pops. You could also pass a boolean parameter to the macro to tell if you need to conserve edi registers or not, that would also save instructions.
A macro can allow you to choose between different methods of doing it based on how fast each method is for the different data sizes. One method for size x-y, another method for size v-t etc. If the buffer is only one byte you could make the macro much faster, avvoiding unnecesary overhead.
Take advantage of macro's :U
Quote from: zemtex on September 26, 2010, 06:39:05 PM
I would use a macro to set spaces in a buffer. A macro can customize the needs for peak performance and it would also eliminate the unwanted push and pops. You could also pass a boolean parameter to the macro to tell if you need to conserve edi registers or not, that would also save instructions.
A macro can allow you to choose between different methods of doing it based on how fast each method is for the different data sizes. One method for size x-y, another method for size v-t etc. If the buffer is only one byte you could make the macro much faster, avvoiding unnecesary overhead.
Take advantage of macro's :U
Feel free to post any working example you like. :U
Frank
On my laptop, (macbook pro, using windows 7 64 bit) I get this results
For ClearBufferNew4.exe:
Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz (SSE4)
16.270.158 cycles for RtlZeroMemory
22.256.374 cycles for FrkTons
17.003.011 cycles for rep stosd
22.407.395 cycles for movdqa
21.586.957 cycles for movaps
20.894.627 cycles for FrkTons New
21.574.685 cycles for movups
21.463.688 cycles for movupd
8.449.814 cycles for MOVNTDQ
for clearbufx64_3.exe:
Clearing done
65323406 clocks for a 33554432 bytes buffer with using REP STOSQ
Clearing done
33619797 clocks for a 33554432 bytes buffer with using MOVNTDQ
I guess what is faster will depend on the machine.
Quote from: xanatose on October 01, 2010, 01:17:55 AM
On my laptop, (macbook pro, using windows 7 64 bit) I get this results
.........
I guess what is faster will depend on the machine.
Hi!
Thanks for testing!
Just ClearBufferNew4.exe is 32bit app - and used 32bit REP STOSD (which is probably cached for making effective transaction with system bus), and clearbufx64_3.exe is 64bit app - and used 64bit REP STOSQ, which is probably not cached while writing progressed, because timings is very close with consideration of the same SSE2 algo which is write 128bits not-cached.
Alex