Print Page - The fastest way to clear a buffer

Title: The fastest way to clear a buffer
Post by: frktons on August 24, 2010, 08:47:34 PM

Hi all.
I'm asking myself how can I clear a buffer in the fastest possible way?
I've an area of 8,000 bytes and I want to fill it with spaces after having
used it for other purposes.
I Came up with this solution and I'm wondering if there are better and
faster ways to do it.

Code Select



;----------------------------------------------------------------------
; Fast way for clearing [putting all spaces into] a
; structure CHAR_INFO totalling 8000 bytes.
;----------------------------------------------------------------------
; Author: frktons @ MASM32 forum
; Date: 24/aug/2010.
;----------------------------------------------------------------------


include \masm32\include\masm32rt.inc


ClearBuffer PROTO :DWORD


;----------------------------------------------------------------------


.data?

    buf2clear CHAR_INFO 2000 dup (<>)
    rHnd      HANDLE ?

    howmany   dd ?
    buffer    INPUT_RECORD <>    
   

.code

start:

Main PROC

    INVOKE GetStdHandle, STD_INPUT_HANDLE
    mov rHnd,eax

    INVOKE ClearBuffer, ADDR buf2clear
    
    print "Clearing done",13,10,13,10
    print "Press any key to close...",13,10
    
    CALL AnyKey

finish: INVOKE ExitProcess,0

    ret

Main ENDP

; -------------------------------------------------------------------------   

ClearBuffer PROC AddrBuffer:DWORD

    mov eax, AddrBuffer
    mov ecx, 1000
    mov bl, 32
    mov bh, bl
    bswap   ebx
    mov bl, 32
    mov bh, bl

cycle:

    mov [eax], ebx
    add eax, 4
    mov [eax], ebx
    add eax, 4
    dec ecx
    jnz cycle
        

    ret

ClearBuffer ENDP

; -------------------------------------------------------------------------
;Returns: key code in buffer.KeyEvent.wVirtualKeyCode WORD size
; -------------------------------------------------------------------------

AnyKey PROC

again: 

    INVOKE ReadConsoleInput,rHnd,offset buffer,1,offset howmany
    cmp buffer.EventType,KEY_EVENT
    jnz again

    cmp buffer.KeyEvent.bKeyDown,0
    jz again

    ret

AnyKey ENDP

; -------------------------------------------------------------------------

end start

Any improvement possible?

Thanks

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 24, 2010, 08:55:47 PM

just the other day, we were playing with RtlZeroMemory in one of the other threads
you might try that

Title: Re: The fastest way to clear a buffer
Post by: Magnum on August 24, 2010, 09:16:02 PM

Code Select


.DATA
 
    ValueOK     db  "Memory zeroed out.",0  
    Sample      db  "BOX",0
    Storage     db  "Co-ordinates of the Ark of the Covenant are...",0  

.data?                                        
     
    Storage1     db  256 dup(?)   
    
.CODE

start:

invoke  RtlZeroMemory, ADDR Storage, sizeof Storage ; in kernel32.inc

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 24, 2010, 09:42:37 PM

Well, he wants spaces, not zeroes, but a rep stosd is most probably the fastest way to fill an 8k buffer with spaces.

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1252    cycles for RtlZeroMemory
1231    cycles for rep stosd.

Title: Re: The fastest way to clear a buffer
Post by: Antariy on August 24, 2010, 09:56:01 PM

Hi, Frank!

If change this code to:

Code Select


ClearBuffer PROC AddrBuffer:DWORD

    mov eax, AddrBuffer
    mov ecx, 1000
    mov ebx,20202020h ; change filling ebx to one command
cycle:

    mov [eax], ebx
    mov [eax+4], ebx
    add eax, 8
    dec ecx
    jnz cycle
        

    ret

ClearBuffer ENDP

This works?

Or this:

Code Select


ClearBuffer PROC AddrBuffer:DWORD
    mov edx,edi
    mov ecx, 2000  <--- This is must be 2000. Thanks to Jochen!
    mov edi, AddrBuffer
    mov eax,20202020h
    rep stosd
    mov edi,edx
    ret

ClearBuffer ENDP

Test this.

Alex

EDITED. Jochen try to confuse me with one bug. So, Jochen, this is not my code, and in real-time of online internet this is hard - concentrate. I make suggestion, and only. Frank can find solution himself. I only show the way (no need in filling ebx by part, maybe used small code to do).

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 24, 2010, 09:58:45 PM

Or, if you are not scared of SSE2:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1261    cycles for RtlZeroMemory
1233    cycles for rep stosd
1014    cycles for movdqa

Title: Re: The fastest way to clear a buffer
Post by: Antariy on August 24, 2010, 10:03:35 PM

Quote from: jj2007 on August 24, 2010, 09:58:45 PM
Or, if you are not scared of SSE2:
Code Select Expand
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 1261 cycles for RtlZeroMemory 1233 cycles for rep stosd 1014 cycles for movdqa

Jochen, not confuse Frank with your experience :) All knows, what you are very like SSE2. What about movaps?

Alex

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 24, 2010, 10:54:21 PM

Quote from: Antariy on August 24, 2010, 10:03:35 PMWhat about movaps?

Identical. One byte shorter, of course.

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 24, 2010, 11:00:51 PM

Quote from: dedndave on August 24, 2010, 08:55:47 PM
just the other day, we were playing with RtlZeroMemory in one of the other threads
you might try that

Hi Dave.
The name RtlZeroMemory suggests this function clears to zero an area of memory.
It can be useful for other situations, here I need to clear to spaces [ASCII 32].

Quote from: Antariy on August 24, 2010, 09:56:01 PM
Hi, Frank!

If change this code to:
Code Select Expand
ClearBuffer PROC AddrBuffer:DWORD mov eax, AddrBuffer mov ecx, 1000 mov ebx,20202020h ; change filling ebx to one command cycle: mov [eax], ebx mov [eax+4], ebx add eax, 8 dec ecx jnz cycle ret ClearBuffer ENDP

This works?

Or this:

Code Select Expand
ClearBuffer PROC AddrBuffer:DWORD mov edx,edi mov ecx, 2000 <--- This is must be 2000. Thanks to Jochen! mov edi, AddrBuffer mov eax,20202020h rep stosd mov edi,edx ret ClearBuffer ENDP

Test this.

Alex

EDITED. Jochen try to confuse me with one bug. So, Jochen, this is not my code, and in real-time of online internet this is hard - concentrate. I make suggestion, and only. Frank can find solution himself. I only show the way (no need in filling ebx by part, maybe used small code to do).

Thanks Alex, The first solution should gain some cycles compared to mine,
the second one using stosd should be faster according to your comments,
I have to test it and to understand how stosd works, it is the first time
I see it :P

Quote from: jj2007 on August 24, 2010, 09:58:45 PM
Or, if you are not scared of SSE2:
Code Select Expand
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 1261 cycles for RtlZeroMemory 1233 cycles for rep stosd 1014 cycles for movdqa

Hi Jochen, if you post the code I can have a look at it.
I'm not scared of SSE2/3/4 but I don't know them so it could
be an occasion to get INTEL manuals working a little. :lol

And last but not least, how does my version performs, compared to:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1261    cycles for RtlZeroMemory
1233    cycles for rep stosd
1014    cycles for movdqa

How much faster these methods are compared to the first I posted?

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 24, 2010, 11:12:56 PM

Quote from: frktons on August 24, 2010, 11:00:51 PM
How much faster these methods are compared to the first I posted?

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1252    cycles for RtlZeroMemory
2024    cycles for FrkTons
1233    cycles for rep stosd
1014    cycles for movdqa
1013    cycles for movaps

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 24, 2010, 11:18:11 PM

Quote from: jj2007 on August 24, 2010, 11:12:56 PM
Quote from: frktons on August 24, 2010, 11:00:51 PM
How much faster these methods are compared to the first I posted?

Code Select Expand
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 1252 cycles for RtlZeroMemory 2024 cycles for FrkTons 1233 cycles for rep stosd 1014 cycles for movdqa 1013 cycles for movaps

Thanks Jochen,
now I've an idea of the performance gap among the various methods.
Time to study them a little, tomorrow and the days to come. :U

On my pc I've these results:

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
1058    cycles for RtlZeroMemory
2022    cycles for FrkTons
1056    cycles for rep stosd
532     cycles for movdqa
531     cycles for movaps

1056    cycles for RtlZeroMemory
2318    cycles for FrkTons
1224    cycles for rep stosd
616     cycles for movdqa
613     cycles for movaps


--- ok ---

Interesting enough that RtlZeroMemory a probably C/C++ function, is
2:1 faster than the handwritten elementary assembly version I coded. :P

Title: Re: The fastest way to clear a buffer
Post by: ecube on August 25, 2010, 12:08:08 AM

I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 25, 2010, 06:30:46 AM

Quote from: E^cube on August 25, 2010, 12:08:08 AM
I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.

Last post of that thread:

Quote from: hutch-- on December 05, 2009, 11:14:25 AM
REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster

1. Anybody who is able to use Olly will quickly find out that RtlZeroMemory is just a wrapper for rep stosd
2. We had not beaten to death the option of doing it in SSE2 - that's why I added it. With success :green (but attention - this requires 16-byte alignment of the buffer).

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 08:48:24 AM

Quote from: jj2007 on August 25, 2010, 06:30:46 AM
Quote from: E^cube on August 25, 2010, 12:08:08 AM
I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.

Last post of that thread:
Quote from: hutch-- on December 05, 2009, 11:14:25 AM
REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster

1. Anybody who is able to use Olly will quickly find out that RtlZeroMemory is just a wrapper for rep stosd
2. We had not beaten to death the option of doing it in SSE2 - that's why I added it. With success :green (but attention - this requires 16-byte alignment of the buffer).

I have a question about RtlZeroMemory: Could we call it in some way so that this function
fills the buffer with a character of our choice or it just zeroes the area? is it parameterless?

By the way, the SSE2 solution you posted looks much faster than it, so why not use it in modern
machine? :P

Thanks

Title: Re: The fastest way to clear a buffer
Post by: hutch-- on August 25, 2010, 08:52:18 AM

Frank,

have a play with REP STOSD, apart from SSE you will struggle to do much better.

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 09:12:52 AM

Quote from: hutch-- on August 25, 2010, 08:52:18 AM
Frank,

have a play with REP STOSD, apart from SSE you will struggle to do much better.

Certainly I'll do play a little with it, and with some SSE as well afterwhile. My machine is able to
do so many things I don't even suspect :P

Title: Re: The fastest way to clear a buffer
Post by: Rockoon on August 25, 2010, 01:39:53 PM

AMD Phenom(tm) II X6 1055T Processor (SSE3)
557 cycles for RtlZeroMemory
2012 cycles for FrkTons
549 cycles for rep stosd
1509 cycles for movdqa
1509 cycles for movaps

556 cycles for RtlZeroMemory
3014 cycles for FrkTons
549 cycles for rep stosd
1016 cycles for movdqa
1015 cycles for movaps

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 25, 2010, 01:45:39 PM

Really surprising, Rockoon. There seem to be huge differences in the way rep stosd is implemented.

Code Select

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
2515    cycles for RtlZeroMemory
4300    cycles for FrkTons
2486    cycles for rep stosd
2491    cycles for movdqa
2387    cycles for movaps

Title: Re: The fastest way to clear a buffer
Post by: hutch-- on August 25, 2010, 01:49:26 PM

Code Select


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
1055    cycles for RtlZeroMemory
2018    cycles for FrkTons
1047    cycles for rep stosd
531     cycles for movdqa
531     cycles for movaps

1055    cycles for RtlZeroMemory
2026    cycles for FrkTons
1048    cycles for rep stosd
521     cycles for movdqa
519     cycles for movaps


--- ok ---

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 25, 2010, 03:36:27 PM

REP STOSD is simple enough
i have to ask, though
why do you want to clear the char buffer ? - lol
won't it be filled in by the next read/fill operation ?

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 03:43:41 PM

Quote from: dedndave on August 25, 2010, 03:36:27 PM
REP STOSD is simple enough
i have to ask, though
why do you want to clear the char buffer ? - lol
won't it be filled in by the next read/fill operation ?

Yes sir, it'll be filled with the next operation, but not my curiosity :P

And while we are here, I tried to use some SSE mnemonics to do something
different, because the ability to use 16 bytes register allures me a lot, but those
nasty little endians make me crazy:

Code Select


;----------------------------------------------------------------------
; Fast way for reversing a 16 bytes string with SSE instructions.
;----------------------------------------------------------------------
; Author: frktons @ MASM32 forum
; Date: 25/aug/2010.
;----------------------------------------------------------------------


include \masm32\include\masm32rt.inc
.686
.xmm

;----------------------------------------------------------------------

.data
align 16

    str1       db   "0123456789ABCDEF",0  ; original string
    ptr_str1   dd   str1                  ; pointer to the string

align 16

    str2       db   "                ",0  ; reversed string
    ptr_str2   dd   str2                  ; pointer to reversed string

    imm8       db   27 ; bit pattern 00011011 used by pshufd to reverse
                       ; the order of the 4 DW of an xmm register


;----------------------------------------------------------------------


.data?

    rHnd      HANDLE ?

    howmany   dd ?
    buffer    INPUT_RECORD <>    
   

.code

start:

Main PROC

    INVOKE GetStdHandle, STD_INPUT_HANDLE
    mov rHnd,eax

    print "original string: "
    print ptr_str1,13,10,13,10

    CALL  rev_sse2              
    
    print "reversed string: "
    print ptr_str2,13,10,13,10
    
    CALL AnyKey

finish: INVOKE ExitProcess,0

    ret

Main ENDP

; -------------------------------------------------------------------------   

rev_sse2 PROC 

    mov eax, ptr_str1
    mov ebx, ptr_str2
    
    movdqa   xmm0, [eax]
    pshufd   xmm1, xmm0, 27
    movdqa   [ebx], xmm1

    ret

rev_sse2 ENDP

; -------------------------------------------------------------------------
;Returns: key code in buffer.KeyEvent.wVirtualKeyCode WORD size
; -------------------------------------------------------------------------

AnyKey PROC

again: 

    INVOKE ReadConsoleInput,rHnd,offset buffer,1,offset howmany
    cmp buffer.EventType,KEY_EVENT
    jnz again

    cmp buffer.KeyEvent.bKeyDown,0
    jz again

    ret

AnyKey ENDP

; -------------------------------------------------------------------------

end start

gives me not what I want, the reversed string, but something a bit
different:

Code Select


original string: 0123456789ABCDEF

reversed string: CDEF89AB45670123

aren't those little endians nasty enough?
Or is my n00b-iness that is big [endian] enough? :lol

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 25, 2010, 03:46:31 PM

i don't think SSE give you a way to reverse bytes in a dword
the BSWAP instruction does that, though

Code Select

;EAX = 12345678h
bswap eax
;EAX = 78563412h

if you want to swap nybbles, that's another story
a while back, we were playing with reversing all the bits in a dword register
there was a rather ineresting algo for that

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 03:48:58 PM

Quote from: dedndave on August 25, 2010, 03:46:31 PM
i don't think SSE give you a way to reverse bytes in a dword
the BSWAP instruction does that, though

Yes Master, I remember the old lesson about bswap that you and
Jochen gave me some time ago. I was just experimenting this opportunity
of SSE mnemonics. Maybe there is even a way to reverse the all with SSE
but I actually don't know ::)

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 25, 2010, 05:28:18 PM

You can reverse 16 bytes with a single instruction called pshufb, but it's SSE4.

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 05:35:27 PM

Quote from: jj2007 on August 25, 2010, 05:28:18 PM
You can reverse 16 bytes with a single instruction called pshufb, but it's SSE4.

Thanks Jochen. I'll wait until the next CPU then. :P

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 08:15:47 PM

I started a new thread on 64 bit section because
rep stosd was considered the fastest way to inizialize a block of
memory in 32 bit assembly.
Now SSE instructions beat it on INTEL machine at least.
It's my opinion that in 64 bit machines, working with 64 bit native operations,
we could get better results than SSE mnemonics just using general 64 bit registers.

To prove it I need the rep stosd version translated into 64 bit assembly
and tested.

Anyone wants to engage?

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 25, 2010, 08:30:53 PM

Quote from: frktons on August 25, 2010, 08:15:47 PM
Anyone wants to engage?

What about you? I'll give you a starting point:

Code Select

    mov rax, 20202020202020202020202020202020h
    mov rdi, offset buffer
    mov rcx, 1000
    rep stosd

I can't test it because my OS and CPU are 32 bit. Now don't be shy, just go ahead!

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 25, 2010, 08:39:47 PM

it's not REP STOSQ ???

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 08:58:46 PM

Quote from: jj2007 on August 25, 2010, 08:30:53 PM
Quote from: frktons on August 25, 2010, 08:15:47 PM
Anyone wants to engage?

What about you? I'll give you a starting point:
Code Select Expand
mov rax, 20202020202020202020202020202020h mov rdi, offset buffer mov rcx, 1000 rep stosd

I can't test it because my OS and CPU are 32 bit. Now don't be shy, just go ahead!

Thanks Jochen. I'll gladly try it if you tell me how do I compile it?
Is MASM32 enough or have I to use any other tool?
And a last question:

Code Select


.686
.xmm

are enough or have I to specify something else?

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 25, 2010, 09:16:49 PM

Quote from: frktons on August 25, 2010, 08:58:46 PM
Is MASM32 enough or have I to use any other tool?

JWasm (http://www.japheth.de/JWasm/Win64_1.html) is the best option, but I can't tell you more since my OS is 32 bit.

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 25, 2010, 09:23:51 PM

Well, I asked in 64 bit subforum because I'm not ready for 64 bit assembling.
My OS is 64 bit and my machine too, but I know too little to do it myself, and
I don't even have a clue on how to use GoAsm or JWasm or ML64. ::)

By the way, instead of RtlZeroMemory it's probably better to use this MACRO from Microsoft
to accomplish the task of filling a block of memory:

Code Select


void FillMemory(
  [out]  PVOID Destination,
  [in]   SIZE_T Length,
  [in]   BYTE Fill
);

What do you think?

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 25, 2010, 11:29:46 PM

The FillMemory macro calls the RtlFillMemory function.

You will find the following in windows.inc:

Code Select


FillMemory EQU RtlFillMemory

I would just call RtlFillMemory instead of FillMemory just to keep things straightforward.

RtlZeroMemory fills memory with zeros, RtlFillMemory is for filling memory with other characters.

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 25, 2010, 11:38:26 PM

i think i would use REP STOSD/Q
or have we forgotten how to write ASM ?

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 26, 2010, 12:49:20 AM

Quote from: Greg Lyon on August 25, 2010, 11:29:46 PM
The FillMemory macro calls the RtlFillMemory function.

You will find the following in windows.inc:

Code Select Expand
FillMemory EQU RtlFillMemory

I would just call RtlFillMemory instead of FillMemory just to keep things straightforward.

RtlZeroMemory fills memory with zeros, RtlFillMemory is for filling memory with other characters.

Thanks Greg, this is what I meant. There is this MACRO from Microsoft that is quite efficient, and
is an implementation of rep stosd

Quote from: dedndave on August 25, 2010, 11:38:26 PM
i think i would use REP STOSD/Q
or have we forgotten how to write ASM ?

You are right, Master. Only one thing to meditate upon: REP STOSQ is only implemented
inside VISUAL C++, and

Quote
This routine is only available as an intrinsic

And in Assembly as INTEL says:

Quote
In 64-bit mode, the default address size is 64 bits, 32-bit address size is supported using the prefix 67H. Using a REX prefix in the form of REX.W promotes operation on doubleword operand to 64 bits. The promoted no-operand mnemonic is STOSQ. STOSQ (and its explicit operands variant) store a quadword from the RAX register into the destination addressed by RDI or EDI.

That is quite obscure meaning for a "premium n00b" of my level :P

My results with RtlFillMemory added to the testbed:

Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1057 cycles for RtlZeroMemory
2017 cycles for FrkTons
1050 cycles for rep stosd
568 cycles for movdqa
556 cycles for movaps
1098 cycles for RtlFillMemory

1078 cycles for RtlZeroMemory
2047 cycles for FrkTons
1052 cycles for rep stosd
555 cycles for movdqa
548 cycles for movaps
1111 cycles for RtlFillMemory

--- ok ---

In the "sloppy category" my routine beats anybody's else. :P

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 01:21:33 AM

give yourself some credit Frank - lol
STOSQ stores a qword (64 bit value)
REP STOSQ is probably fast as hell on 64-bit machines
Jochen showed REP STOSD in his 64-bit example code - that was probably just a small oversight
i am sure he meant REP STOSQ
RtlZeroMemory probably preserves ESI (or RSI), but other than that, it is straightforward
you can assume that the direction flag is clear, as it should be - i am sure RtlZeroMemory also makes that assumption

so....
load the value you want repeated into EAX/RAX
load the repeat count into ECX/RCX
load the address into EDI/RDI
then do the REP STOSD or REP STOSQ

it will be quote fast as long as the address is 4-byte-aligned for 32-bit code or 8-byte-aligned for 64-bit code

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 26, 2010, 01:23:23 AM

Quote from: dedndave
i think i would use REP STOSD/Q
or have we forgotten how to write ASM ?

Dave,

I new someone was going to say something like that. I was only commenting on the use of FillMemory, not on what was fastest to do the job.

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 01:25:53 AM

you're right, of course, Greg
many of these functions were written for C-programmers :lol

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 26, 2010, 01:40:47 AM

Dave,

So are you saying I don't know how to write ASM code?

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 01:46:49 AM

no - no - not at all, Greg - lol
you're tops :U

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 26, 2010, 01:48:52 AM

Dave,

No, I'm definitely not tops. Smartass.

Title: Re: The fastest way to clear a buffer
Post by: ecube on August 26, 2010, 01:55:32 AM

Quote from: Greg Lyon on August 26, 2010, 01:48:52 AM
Dave,

No, I'm definitely not tops. Smartass.

heh you must not be aware of Dave's humor, shame.

Also to the OP, if you're just trying to "clear" a buffer to use with an ascii string you can

Code Select


lea edi,buffer
mov byte ptr [edi],0

and lstcpy etc.. will consider it empty.

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 26, 2010, 01:58:43 AM

Quote from: E^cubeheh you must not be aware of Dave's humor, shame.

Oh, I'm fully aware of it.

Title: Re: The fastest way to clear a buffer
Post by: ecube on August 26, 2010, 02:03:32 AM

Ahh I remember you, you were insulting me in PM :tdown I see you're playing nice with others too... :snooty:

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 02:06:48 AM

i meant that, Greg - i wish i knew all the stuff you guys know
i might learn some of it, if i had more time, too
i had planned on spending the entire summer learning more about win32 code
as it turned out, i didn't get to spend hardly any time on code
this weekend, it looks like i am off to Michigan to remodel a house, too - lol
by the time the day is done, i will be too tired to concentrate on learning anything new
maybe by the time christmas rolls around......

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 26, 2010, 02:09:04 AM

Quote from: E^cube on August 26, 2010, 01:55:32 AM

To the OP, if you're just trying to "clear" a buffer to use with an ascii string you can

Code Select Expand
lea edi,buffer mov byte ptr [edi],0

and lstcpy etc.. will consider it empty.

I'm trying to fill a buffer with spaces [ASCII 32] and If somebody can translate
and compile in 64 bit Assembly the equivalent of:

Code Select


			push edi
			mov ecx, 2000
			mov edi, offset Dest
			mov eax, 20202020h
			rep stosd
			pop edi

I'll have a more detailed idea of how 64 bit native registers compare to
SSE instructions. My machine and OS are 64 bit, I'm not yet. :'(

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 26, 2010, 02:10:41 AM

E^cube,

No, you have that backwards E^cube, it was you sending me nasty PMs. And you got a chance to take a jab at me and you sure took advantage of it didn't you?

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 26, 2010, 02:13:28 AM

Dave,

No problem Dave, I guess I'm just a little on edge tonight, I'm sorry.

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 26, 2010, 02:21:08 AM

Quote from: dedndave on August 26, 2010, 01:21:33 AM
give yourself some credit Frank - lol
STOSQ stores a qword (64 bit value)
REP STOSQ is probably fast as hell on 64-bit machines
Jochen showed REP STOSD in his 64-bit example code - that was probably just a small oversight
i am sure he meant REP STOSQ
RtlZeroMemory probably preserves ESI (or RSI), but other than that, it is straightforward
you can assume that the direction flag is clear, as it should be - i am sure RtlZeroMemory also makes that assumption

so....
load the value you want repeated into EAX/RAX
load the repeat count into ECX/RCX
load the address into ESI/RSI
then do the REP STOSD or REP STOSQ

it will be quote fast as long as the address is 4-byte-aligned for 32-bit code or 8-byte-aligned for 64-bit code

Well, Dave I got the 32 bit version of rep stosd, it is not that difficult.
I'm trying to see in 64 bit how it works, and I don't have a clue on how to do it
because I have no 64 bit experience and tools at all. ::)

So I'm asking somebody to translate and compile in 64 bit mode to test the
performance it gets.

Title: Re: The fastest way to clear a buffer
Post by: GregL on August 26, 2010, 02:25:52 AM

frktons,

I don't think anyone has written timing routines in x64 yet.

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 26, 2010, 02:29:09 AM

Quote from: Greg Lyon on August 26, 2010, 02:25:52 AM
frktons,

I don't think anyone has written timing routines in x64 yet.

Time to start? Well, I can be satisfied just seeing how it translate into 64 bit.
It shouldn't be too complex:

Code Select


			push edi
			mov ecx, 2000
			mov edi, offset Dest
			mov eax, 20202020h
			rep stosd
			pop edi

::)

Jochen suggested to start from:

Code Select


    mov rax, 20202020202020202020202020202020h
    mov rdi, offset Dest
    mov rcx, 1000
    rep stosq

And it is quite clear, but what about all the rest of the program?
And moreover is this a legal 64 bit syntax that a 64 bit assembler can assemble?
Which one? JWasm, GoAsm, ML64?

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 02:35:11 AM

Code Select

        push    rdi
        mov     rcx,1000
        mov     rdi,offset Dest
        mov     rax,2020202020202020h
        rep     stosq
        pop     rdi

i have no way to test it :P
JwAsm will assemble it for you
you could clear a larger area using 32-bit and 64-bit, then use a stopwatch :lol

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 26, 2010, 02:37:02 AM

Quote from: dedndave on August 26, 2010, 02:35:11 AM
Code Select Expand
push rdi mov rcx,1000 mov rdi,offset Dest mov rax,2020202020202020h rep stosq pop rdi
i have no way to test it :P
JwAsm will assemble it for you

oh, I was writing while you replied. Have a look at my prev post Master.

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 02:38:13 AM

i think Jochen has a few too many spaces in there, Frank :bg
16 spaces = 128 bits in a 64-bit reg
overflow !!! - lol

Title: Re: The fastest way to clear a buffer
Post by: frktons on August 26, 2010, 02:41:43 AM

Quote from: dedndave on August 26, 2010, 02:38:13 AM
i think Jochen has a few too many spaces in there, Frank :bg
16 spaces = 128 bits in a 64-bit reg
overflow !!! - lol

You are right, those were 128 bit xmm registers He used with SSE mnemonics and probably
forgot we were going to native 64 bit registers rxx.
Me too :P

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 26, 2010, 06:31:16 AM

2020202020202020h

Folks, if I remember well, Hutch knows a revolutionary mathematical technology to count the spaces in this expression :bg

Title: Re: The fastest way to clear a buffer
Post by: hutch-- on August 26, 2010, 07:06:51 AM

:bg

Huh ?

In my own case "revolutionary" and "mathematical" are not compatible in the same sentence. I freely admit to "Eenie meanie minie moe" technology (fingers) and have to cheat and use computers to add up numbers.

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 26, 2010, 08:16:23 AM

Quote from: hutch-- on August 26, 2010, 07:06:51 AM
:bg

Huh ?

In my own case "revolutionary" and "mathematical" are not compatible in the same sentence. I freely admit to "Eenie meanie minie moe" technology (fingers) and have to cheat and use computers to add up numbers.

I meant the "Eenie meanie minie moe" technology. It is quite sufficient to see that there are 8 spaces in 2020202020202020h, not 16 as Dave suspected :wink

Title: Re: The fastest way to clear a buffer
Post by: hutch-- on August 26, 2010, 09:57:58 AM

:bg

Dave is probably cheating and using both hands.

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 12:13:25 PM

Quote from: jj2007 on August 25, 2010, 08:30:53 PM
What about you? I'll give you a starting point:
Code Select Expand
mov rax, 20202020202020202020202020202020h mov rdi, offset buffer mov rcx, 1000 rep stosd

i'd say that either Jochen has had one too many cappuccinos or his space bar is stuck :lol
note: it should also be STOSQ - not STOSD

http://www.masm32.com/board/index.php?topic=14685.msg119244#msg119244

i cheated and used both hands to create this post

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on August 26, 2010, 04:15:55 PM

Quote from: dedndave on August 26, 2010, 12:13:25 PM
Quote from: jj2007 on August 25, 2010, 08:30:53 PM
What about you? I'll give you a starting point:
Code Select Expand
mov rax, 20202020202020202020202020202020h mov rdi, offset buffer mov rcx, 1000 rep stosd
i'd say that either Jochen has had one too many cappuccinos or his space bar is stuck :lol
note: it should also be STOSQ - not STOSD

Oops, you are right! I had looked at replies 51 & 52 and saw exactly 8 spaces in the code... but that was your code, not mine :red
Apologies :thumbu

Title: Re: The fastest way to clear a buffer
Post by: dedndave on August 26, 2010, 06:55:37 PM

no prob JJ :P
it feels good to catch you, once in a while

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 03, 2010, 07:21:53 PM

After some experimentation I got these results:

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
2059    cycles for RtlZeroMemory
4023    cycles for FrkTons
2070    cycles for rep stosd
1062    cycles for movdqa
1062    cycles for movaps
1024    cycles for FrkTons New
5023    cycles for movups
5050    cycles for movupd

2087    cycles for RtlZeroMemory
4043    cycles for FrkTons
2064    cycles for rep stosd
1038    cycles for movdqa
1050    cycles for movaps
1016    cycles for FrkTons New
5036    cycles for movups
5042    cycles for movupd


--- ok ---

How can it be possible?
The new test attached.

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on September 03, 2010, 08:44:08 PM

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2293    cycles for RtlZeroMemory
4029    cycles for FrkTons
2272    cycles for rep stosd
2017    cycles for movdqa
2018    cycles for movaps
2140    cycles for FrkTons New
6026    cycles for movups
6021    cycles for movupd

Can't see any surprises in here ::)

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 03, 2010, 09:25:34 PM

Quote from: jj2007 on September 03, 2010, 08:44:08 PM
Code Select Expand
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 2293 cycles for RtlZeroMemory 4029 cycles for FrkTons 2272 cycles for rep stosd 2017 cycles for movdqa 2018 cycles for movaps 2140 cycles for FrkTons New 6026 cycles for movups 6021 cycles for movupd

Can't see any surprises in here ::)

I should have imagined that I did something wrong :P

Maybe if you use an older CPU the program wouldn't even run :lol

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 03, 2010, 10:46:50 PM

Hi, Frank!

This is results on my CPU:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
5144    cycles for RtlZeroMemory
8253    cycles for FrkTons
4952    cycles for rep stosd
4790    cycles for movdqa
4801    cycles for movaps
4862    cycles for FrkTons New
10594   cycles for movups
10601   cycles for movupd

4960    cycles for RtlZeroMemory
8383    cycles for FrkTons
4952    cycles for rep stosd
4820    cycles for movdqa
4795    cycles for movaps
4861    cycles for FrkTons New
10598   cycles for movups
10602   cycles for movupd

Alex

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on September 03, 2010, 10:50:43 PM

Quote from: frktons on September 03, 2010, 09:25:34 PM
Maybe if you use an older CPU the program wouldn't even run :lol

You would need a very old CPU.

On mine, you can save 40 cycles with a small modification:

Code Select

			mov edx, offset Dest
			lea ecx, [edx+16000]
			mov eax, 20202020h
			movd xmm0, eax
			pshufd xmm0, xmm0, 0
;                  movdqa xmm1, xmm0 
;                  movdqa xmm2, xmm0 
;                  movdqa xmm3, xmm0 
;                  movdqa xmm4, xmm0                                      

		@@:
			movdqa [edx], xmm0
			movdqa [edx + 16], xmm0
			movdqa [edx + 32], xmm0
			movdqa [edx + 48], xmm0
			movdqa [edx + 64], xmm0
			add edx, 80
			cmp edx, ecx
			jl @B

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 03, 2010, 10:54:27 PM

And you can save some bytes, if use MOVAPS for moving to regs and to memory :)

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 03, 2010, 11:04:44 PM

Quote from: jj2007 on September 03, 2010, 10:50:43 PM
Quote from: frktons on September 03, 2010, 09:25:34 PM
Maybe if you use an older CPU the program wouldn't even run :lol

You would need a very old CPU.

On mine, you can save 40 cycles with a small modification:
Code Select Expand
mov edx, offset Dest lea ecx, [edx+16000] mov eax, 20202020h movd xmm0, eax pshufd xmm0, xmm0, 0 ; movdqa xmm1, xmm0 ; movdqa xmm2, xmm0 ; movdqa xmm3, xmm0 ; movdqa xmm4, xmm0 @@: movdqa [edx], xmm0 movdqa [edx + 16], xmm0 movdqa [edx + 32], xmm0 movdqa [edx + 48], xmm0 movdqa [edx + 64], xmm0 add edx, 80 cmp edx, ecx jl @B

I already tested this kind of unrolling, but the best performance on Core 2 duo
happens with 5 different XMM registers. The CPU architecture plays the big
role for the 20-50 cycles difference. Not that much anyway. In my opinion it's
just the cache memory that gives some extra speed on Core 2. I'd like to see
what these routines gain or loose on the more recent quad/i3-i7 machines as well.

If anyone has got this newest kind of CPU.

Frank

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 03, 2010, 11:06:20 PM

Quote from: Antariy on September 03, 2010, 10:54:27 PM
And you can save some bytes, if use MOVAPS for moving to regs and to memory :)

Alex

Yes Alex. I'm testing just the speed and MOVDQA looks a little bit faster than MOVAPS.
It's a very tiny difference indeed. At least on my machine.

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 03, 2010, 11:07:59 PM

Frank, on i3-i7 "rep stosd" must work faster on equally as SSE-code. I cannot check this.

What if write to memory with this command: "MOVNTDQ" - non-temporal write to memory - without caching.

Edited:
timings with direct writing:

Code Select


13705   cycles for FrkTons New with MOVNTDQ
13819   cycles for FrkTons New with MOVNTPD
13787   cycles for FrkTons New with MOVNTPS

Other timings omited, because I have posted it already.

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 03, 2010, 11:15:58 PM

Quote from: Antariy on September 03, 2010, 11:07:59 PM
Frank, on i3-i7 "rep stosd" must work faster on equally as SSE-code. I cannot check this.

What if write to memory with this command: "MOVNTDQ" - non-temporal write to memory - without caching.

Alex

Alex MOVNTDQ is quite slow on my machine:

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
2059    cycles for RtlZeroMemory
4029    cycles for FrkTons
2047    cycles for rep stosd
1033    cycles for movdqa
1032    cycles for movaps
1017    cycles for FrkTons New
5024    cycles for movups
5020    cycles for movupd

8047    cycles for MOVNTDQ

2090    cycles for RtlZeroMemory
4046    cycles for FrkTons
2062    cycles for rep stosd
1052    cycles for movdqa
1047    cycles for movaps
1017    cycles for FrkTons New
5036    cycles for movups
5042    cycles for movupd

7864    cycles for MOVNTDQ


--- ok ---

I suppose that if you could do:
rep/stosq with rxx 64 bit register the results would be similar or better
than SSE2 instructions. But nobody has taken the task to compile with a 64 bit
assembler. I'll probably do it when I'll be more familiar with JWASM.
For the time being I don't find the time to study also that :P

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 03, 2010, 11:20:37 PM

Yes, I post my results for non-temporal writes also (in post, which ask for this).

MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 03, 2010, 11:34:53 PM

Quote from: Antariy on September 03, 2010, 11:20:37 PM
Yes, I post my results for non-temporal writes also (in post, which ask for this).

MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.
Alex

I'm trying with 16MB, but the program is taking a lot of time to compile ::)
Will it ever end?

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 03, 2010, 11:46:07 PM

Quote from: frktons on September 03, 2010, 11:34:53 PM
Quote from: Antariy on September 03, 2010, 11:20:37 PM
Yes, I post my results for non-temporal writes also (in post, which ask for this).

MOVNTxxx have advantage when need to make *huge* writes to memory, which is much greater than cache.
Otherwice, "normal" writing faster (8KB buffer - not big for todays caches).
Try write to 32MB, for example, buffer.
Alex

I'm trying with 16MB, but the program is taking a lot of time to compile ::)
Will it ever end?

This is known problem - in-exe allocation of data - maybe, "bug" of MASM. Try JWasm.
Or try allocate buffer with using of heap functions (like GlobalAlloc etc). Allocate 32 or 16MB buffer in heap, and use it in tests.

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 03, 2010, 11:52:45 PM

Quote from: Antariy on September 03, 2010, 11:46:07 PM

This is known problem - in-exe allocation of data - maybe, "bug" of MASM. Try JWasm.
Or try allocate buffer with using of heap functions (like GlobalAlloc etc). Allocate 32 or 16MB buffer in heap, and use it in tests.

Alex

All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 03, 2010, 11:57:37 PM

Quote from: frktons on September 03, 2010, 11:52:45 PM

All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank

For 16MB buffer

Code Select


invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024

in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 04, 2010, 12:05:27 AM

Quote from: Antariy on September 03, 2010, 11:57:37 PM
Quote from: frktons on September 03, 2010, 11:52:45 PM

All new things for me. I'm a little tired at the moment, I'll try tomorrow after studying something about GlobalAlloc etc,
and after resting for a while. It's night here.

Frank

For 16MB buffer

Code Select Expand
invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024

in eax - returned pointer to buffer. Save it in variable, and in testing code load pointer from variable, then use this pointer in reg.

Alex

Alex, is this enough or have I to change something else?

Code Select


.data?
align 16
; Dest	db 16000000 dup(?) ; <------ don't use it anymore
DataPtr  dd ? ; <-------------- Pointer for data allocated

.code
start:
     push 1
     call ShowCpu				; print brand string and SSE level

      invoke GlobalAlloc,GMEM_ZEROINIT or GMEM_FIXED,16*1024*1024

      mov DataPtr, eax  
      
	REPEAT 2
		invoke Sleep, 100
		counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
			invoke RtlZeroMemory, DataPtr, 16000000 <----------------- is this use of DataPtr correct?
		counter_end

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 04, 2010, 12:11:29 AM

Alex you were right, with big buffer I have these results:

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
11628116        cycles for RtlZeroMemory
15883968        cycles for FrkTons
10996290        cycles for rep stosd
15473203        cycles for movdqa
15480211        cycles for movaps
15471071        cycles for FrkTons New
15477872        cycles for movups
15445525        cycles for movupd

8082000 cycles for MOVNTDQ

10999930        cycles for RtlZeroMemory
15870714        cycles for FrkTons
11012185        cycles for rep stosd
15418317        cycles for movdqa
15427633        cycles for movaps
15416041        cycles for FrkTons New
15418995        cycles for movups
15415995        cycles for movupd

8162844 cycles for MOVNTDQ


--- ok ---

and rep/stosd is faster than sse2 instructions.
I modified the number of cycles to perform the test to 1,000
instead of 1 million, to make it shorter.

Frank

Title: Re: The fastest way to clear a buffer
Post by: clive on September 04, 2010, 12:38:21 AM

Absent a newer build here's the result from the last one

Code Select

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
6354    cycles for RtlZeroMemory
10341   cycles for FrkTons
6335    cycles for rep stosd
4143    cycles for movdqa
4108    cycles for movaps
1667    cycles for FrkTons New
8220    cycles for movups
8340    cycles for movupd

6296    cycles for RtlZeroMemory
10333   cycles for FrkTons
6265    cycles for rep stosd
4117    cycles for movdqa
4153    cycles for movaps
1675    cycles for FrkTons New
8227    cycles for movups
8232    cycles for movupd

Core Solo

Code Select

Genuine Intel(R) CPU           T1350  @ 1.86GHz (SSE3)
2319    cycles for RtlZeroMemory
4072    cycles for FrkTons
2305    cycles for rep stosd
2039    cycles for movdqa
2039    cycles for movaps
2155    cycles for FrkTons New
6098    cycles for movups
6088    cycles for movupd

2315    cycles for RtlZeroMemory
4082    cycles for FrkTons
2296    cycles for rep stosd
2038    cycles for movdqa
2038    cycles for movaps
2164    cycles for FrkTons New
6087    cycles for movups
6095    cycles for movupd

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 04, 2010, 11:38:30 AM

Quote from: clive on September 04, 2010, 12:38:21 AM
Absent a newer build here's the result from the last one

Code Select Expand
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4) 6354 cycles for RtlZeroMemory 10341 cycles for FrkTons 6335 cycles for rep stosd 4143 cycles for movdqa 4108 cycles for movaps 1667 cycles for FrkTons New 8220 cycles for movups 8340 cycles for movupd 6296 cycles for RtlZeroMemory 10333 cycles for FrkTons 6265 cycles for rep stosd 4117 cycles for movdqa 4153 cycles for movaps 1675 cycles for FrkTons New 8227 cycles for movups 8232 cycles for movupd

Core Solo

Code Select Expand
Genuine Intel(R) CPU T1350 @ 1.86GHz (SSE3) 2319 cycles for RtlZeroMemory 4072 cycles for FrkTons 2305 cycles for rep stosd 2039 cycles for movdqa 2039 cycles for movaps 2155 cycles for FrkTons New 6098 cycles for movups 6088 cycles for movupd 2315 cycles for RtlZeroMemory 4082 cycles for FrkTons 2296 cycles for rep stosd 2038 cycles for movdqa 2038 cycles for movaps 2164 cycles for FrkTons New 6087 cycles for movups 6095 cycles for movupd

Wooops, The Atom really likes working with many XMM register at a time.
You are right clive I didn't post the new buid, so here it is. :U

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 04, 2010, 02:57:25 PM

And for readability purpose here we have a version that formats with
thousand separator the results of elapsed CPU cycles.
This version tests a buffer of 16MB to fill.

Code Select


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
11.967.493      cycles for RtlZeroMemory
15.682.875      cycles for FrkTons
10.971.464      cycles for rep stosd
15.418.911      cycles for movdqa
15.435.221      cycles for movaps
15.409.998      cycles for FrkTons New
15.405.469      cycles for movups
15.518.687      cycles for movupd
8.056.812       cycles for MOVNTDQ

11.051.772      cycles for RtlZeroMemory
15.535.943      cycles for FrkTons
10.997.179      cycles for rep stosd
15.467.940      cycles for movdqa
15.457.092      cycles for movaps
15.485.439      cycles for FrkTons New
15.514.719      cycles for movups
15.513.319      cycles for movupd
8.053.411       cycles for MOVNTDQ


--- ok ---

attached the "improved version". :P

In my humble n00b-ist opinion, when we'll use REP/STOSQ in 64 bit
native OS with x64 machines,
it is going to win everything else. Not sure about MOVNTDQ anyway.

Frank

Title: Re: The fastest way to clear a buffer
Post by: Rockoon on September 04, 2010, 03:11:41 PM

AMD Phenom(tm) II X6 1055T Processor (SSE3)
10.598.549 cycles for RtlZeroMemory
11.260.417 cycles for FrkTons
10.182.613 cycles for rep stosd
10.131.907 cycles for movdqa
10.115.035 cycles for movaps
10.214.832 cycles for FrkTons New
9.915.582 cycles for movups
10.188.273 cycles for movupd
6.888.405 cycles for MOVNTDQ

10.199.509 cycles for RtlZeroMemory
11.050.508 cycles for FrkTons
10.192.022 cycles for rep stosd
10.131.227 cycles for movdqa
10.113.104 cycles for movaps
10.217.227 cycles for FrkTons New
9.952.748 cycles for movups
10.184.520 cycles for movupd
6.700.808 cycles for MOVNTDQ

Title: Re: The fastest way to clear a buffer
Post by: jj2007 on September 04, 2010, 04:51:36 PM

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
15.891.933      cycles for RtlZeroMemory
23.350.696      cycles for FrkTons
15.800.872      cycles for rep stosd
23.825.786      cycles for movdqa
23.886.914      cycles for movaps
23.872.424      cycles for FrkTons New
23.760.942      cycles for movups
23.730.159      cycles for movupd
9.818.186       cycles for MOVNTDQ

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 04, 2010, 05:05:25 PM

Quote from: jj2007 on September 04, 2010, 04:51:36 PM
Code Select Expand
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 15.891.933 cycles for RtlZeroMemory 23.350.696 cycles for FrkTons 15.800.872 cycles for rep stosd 23.825.786 cycles for movdqa 23.886.914 cycles for movaps 23.872.424 cycles for FrkTons New 23.760.942 cycles for movups 23.730.159 cycles for movupd 9.818.186 cycles for MOVNTDQ

Well for big buffers, like Alex said, MOVNTDQ is faster than anything else
on any machine, according to the tests done so far.

The Atom of Clive was really impressive regarding the multiple use of
XMM registers. I hope he'll post his results for this test as well.

Title: Re: The fastest way to clear a buffer
Post by: clive on September 05, 2010, 01:30:38 PM

Ok, this is from the original Acer Aspire One, I should try it on the newer one with the N450 CPU.

From ClearBufferNew3

Code Select

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
23855499        cycles for RtlZeroMemory
23614698        cycles for FrkTons
23448786        cycles for rep stosd
23485010        cycles for movdqa
23496468        cycles for movaps
23578524        cycles for FrkTons New
23528033        cycles for movups
23437548        cycles for movupd

9166934 cycles for MOVNTDQ

23551485        cycles for RtlZeroMemory
23568079        cycles for FrkTons
23531873        cycles for rep stosd
23494767        cycles for movdqa
23480850        cycles for movaps
23560303        cycles for FrkTons New
23521236        cycles for movups
23485775        cycles for movupd

9156989 cycles for MOVNTDQ

From ClearBufferNew4

Code Select

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
23.621.570      cycles for RtlZeroMemory
23.548.342      cycles for FrkTons
23.633.944      cycles for rep stosd
23.482.208      cycles for movdqa
23.584.646      cycles for movaps
23.504.980      cycles for FrkTons New
23.561.681      cycles for movups
23.543.870      cycles for movupd
9.184.061       cycles for MOVNTDQ

23.566.672      cycles for RtlZeroMemory
23.507.384      cycles for FrkTons
23.601.428      cycles for rep stosd
23.489.780      cycles for movdqa
23.512.724      cycles for movaps
23.516.591      cycles for FrkTons New
23.549.450      cycles for movups
23.596.780      cycles for movupd
9.156.032       cycles for MOVNTDQ

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 05, 2010, 03:41:53 PM

The Atom is again a surprise :dazzled:
It always has terrific results, for the bad or for the good.

Thanks clive

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 05, 2010, 09:41:24 PM

This is trying make x64 app under machine which runs x64 code very badly :P
Post results, please - Frank wait for this very long time.

Frank, this is compiled with GoAsm.

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 05, 2010, 09:47:46 PM

Hi Alex, thanks for doing this test.
The results on my machine are:

Code Select


---------------------------
MOVNTPD
---------------------------
Clocks: 4.090.321.879

(Press [Ctrl]+[C] for copying to clipboard)	
---------------------------
OK   
---------------------------
---------------------------
REP STOSQ
---------------------------
Clocks: 1.977.888.757

(Press [Ctrl]+[C] for copying to clipboard)	
---------------------------
OK   
---------------------------

What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?

Well these results seem to confirm my suspect that on 64 bit machine
REP/STOSQ is the fastest buffer filler instruction for the time being :U

May I know how did you compile it?
I can download GoASM and try to make some experiments.

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 05, 2010, 09:52:30 PM

Quote from: frktons on September 05, 2010, 09:47:46 PM
Hi Alex, thanks for doing this test.
The results on my machine are:

Code Select Expand
--------------------------- MOVNTPD --------------------------- Clocks: 4.090.321.879 (Press [Ctrl]+[C] for copying to clipboard) --------------------------- OK ---------------------------

What buffer did you use and how many LOOP did you perform?
Maybe 1 million LOOP and 16MB?

Frank

Frank you forgot past timings for REP STOSQ, they are in second message box.

I use 32mb buffer and 100 loops of test.

Alex
P.S. which timings of STOSQ?

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 05, 2010, 09:57:05 PM

I posted in the previous post Alex. Have a look.

REP/STOSQ looks a lot faster than MOVNTPD
at least on my machine.

Thanks again for doing the test. :clap:

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 05, 2010, 10:04:49 PM

Quote from: frktons on September 05, 2010, 09:57:05 PM
I posted in the previous post Alex. Have a look.

REP/STOSQ looks a lot faster than MOVNTPD
at least on my machine.

Thanks again for doing the test. :clap:

Frank

Yes, this behaviour is not wondering (what STOSQ faster).

Initially you don't post results for REP STOSQ :P
You add them later :)

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 05, 2010, 10:10:33 PM

Quote from: Antariy on September 05, 2010, 10:04:49 PM

Yes, this behaviour is not wondering (what STOSQ faster).

Initially you don't post results for REP STOSQ :P
You add them later :)
Alex

Yes Alex, because a MessageBox appeared, I didn't know that it would display
a second Message, so I posted the first result. :P

On my CPU REP/STOSQ is 4:1 faster than MOVNTPD, and in some tests even more.
This was just an idea I had that X64 native code and RXX registers MOV are faster than
SSE2 for simple mov of data. And these tests seems to confirm that idea, thanks to you.

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 05, 2010, 10:22:46 PM

Frank, I compile this:

Code Select


goasm /x64 asmfilename.asm
golink asmfilename.obj

Nothing more.

Run /? for apps, and see full help about params.

EDITED: Frank, I forgotten add this:
To link, need add names of DLLs which APIs is used to command line, so:

Code Select


golink asmfilename.obj kernel32.dll user32.dll ... etc

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 05, 2010, 10:24:38 PM

Quote from: Antariy on September 05, 2010, 10:22:46 PM
Frank, I compile this:

Code Select Expand
goasm /x64 asmfilename.asm golink asmfilename.obj

Nothing more.

Run /? for apps, and see full help about params.

Alex

Very good, thanks.

As I have some spare time I'll do some experiment on 64 bit code,
I think I'll enjoy it.

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 18, 2010, 09:20:42 PM

Frank, test app from this post. I commit memory before test, this can (must) gets better results in tests.

Don't forgot post timings :)

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 19, 2010, 04:39:31 PM

The test produces these results on my CPU:

Code Select


Clearing done
183292241 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
921076349 clocks for a 33554432 bytes buffer with using MOVNTDQ

REP/STOSQ is getting faster this way.

With these big numbers a thousand separator would help a lot:

Clearing done
183.292.241 clocks for a 33.554.432 bytes buffer with using REP STOSQ
Clearing done
921.076.349 clocks for a 33.554.432 bytes buffer with using MOVNTDQ

Frank

Title: Re: The fastest way to clear a buffer
Post by: Antariy on September 19, 2010, 09:37:37 PM

Oh...

In last moment I add cpuid to test, but don't make all needed stuff for this... What hurry makes...

Frank, test this new one, please, which attached to post. Previous test is NOT right.

Alex

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 19, 2010, 09:41:54 PM

Code Select


Clearing done
23712588 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
20154312 clocks for a 33554432 bytes buffer with using MOVNTDQ

that's it Alex, MOVNTDQ still faster than REP STOSQ

Frank

Title: Re: The fastest way to clear a buffer
Post by: zemtex on September 26, 2010, 06:39:05 PM

I would use a macro to set spaces in a buffer. A macro can customize the needs for peak performance and it would also eliminate the unwanted push and pops. You could also pass a boolean parameter to the macro to tell if you need to conserve edi registers or not, that would also save instructions.

A macro can allow you to choose between different methods of doing it based on how fast each method is for the different data sizes. One method for size x-y, another method for size v-t etc. If the buffer is only one byte you could make the macro much faster, avvoiding unnecesary overhead.

Take advantage of macro's :U

Title: Re: The fastest way to clear a buffer
Post by: frktons on September 26, 2010, 08:54:31 PM

Quote from: zemtex on September 26, 2010, 06:39:05 PM
I would use a macro to set spaces in a buffer. A macro can customize the needs for peak performance and it would also eliminate the unwanted push and pops. You could also pass a boolean parameter to the macro to tell if you need to conserve edi registers or not, that would also save instructions.

A macro can allow you to choose between different methods of doing it based on how fast each method is for the different data sizes. One method for size x-y, another method for size v-t etc. If the buffer is only one byte you could make the macro much faster, avvoiding unnecesary overhead.

Take advantage of macro's :U

Feel free to post any working example you like. :U

Frank

Title: Re: The fastest way to clear a buffer
Post by: xanatose on October 01, 2010, 01:17:55 AM

On my laptop, (macbook pro, using windows 7 64 bit) I get this results

For ClearBufferNew4.exe:

Code Select


Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz (SSE4)
16.270.158      cycles for RtlZeroMemory
22.256.374      cycles for FrkTons
17.003.011      cycles for rep stosd
22.407.395      cycles for movdqa
21.586.957      cycles for movaps
20.894.627      cycles for FrkTons New
21.574.685      cycles for movups
21.463.688      cycles for movupd
8.449.814       cycles for MOVNTDQ

for clearbufx64_3.exe:

Code Select


Clearing done
65323406 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
33619797 clocks for a 33554432 bytes buffer with using MOVNTDQ

I guess what is faster will depend on the machine.

Title: Re: The fastest way to clear a buffer
Post by: Antariy on October 06, 2010, 10:20:51 PM

Quote from: xanatose on October 01, 2010, 01:17:55 AM
On my laptop, (macbook pro, using windows 7 64 bit) I get this results
.........
I guess what is faster will depend on the machine.

Hi!

Thanks for testing!

Just ClearBufferNew4.exe is 32bit app - and used 32bit REP STOSD (which is probably cached for making effective transaction with system bus), and clearbufx64_3.exe is 64bit app - and used 64bit REP STOSQ, which is probably not cached while writing progressed, because timings is very close with consideration of the same SSE2 algo which is write 128bits not-cached.

Alex

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: frktons on August 24, 2010, 08:47:34 PM