News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

The fastest way to clear a buffer

Started by frktons, August 24, 2010, 08:47:34 PM

Previous topic - Next topic

frktons

Hi all.
I'm asking myself how can I clear a buffer in the fastest possible way?
I've an area of 8,000 bytes and I want to fill it with spaces after having
used it for other purposes.
I Came up with this solution and I'm wondering if there are better and
faster ways to do it.


;----------------------------------------------------------------------
; Fast way for clearing [putting all spaces into] a
; structure CHAR_INFO totalling 8000 bytes.
;----------------------------------------------------------------------
; Author: frktons @ MASM32 forum
; Date: 24/aug/2010.
;----------------------------------------------------------------------


include \masm32\include\masm32rt.inc


ClearBuffer PROTO :DWORD


;----------------------------------------------------------------------


.data?

    buf2clear CHAR_INFO 2000 dup (<>)
    rHnd      HANDLE ?

    howmany   dd ?
    buffer    INPUT_RECORD <>   
   

.code

start:

Main PROC

    INVOKE GetStdHandle, STD_INPUT_HANDLE
    mov rHnd,eax

    INVOKE ClearBuffer, ADDR buf2clear
   
    print "Clearing done",13,10,13,10
    print "Press any key to close...",13,10
   
    CALL AnyKey

finish: INVOKE ExitProcess,0

    ret

Main ENDP

; -------------------------------------------------------------------------   

ClearBuffer PROC AddrBuffer:DWORD

    mov eax, AddrBuffer
    mov ecx, 1000
    mov bl, 32
    mov bh, bl
    bswap   ebx
    mov bl, 32
    mov bh, bl

cycle:

    mov [eax], ebx
    add eax, 4
    mov [eax], ebx
    add eax, 4
    dec ecx
    jnz cycle
       

    ret

ClearBuffer ENDP

; -------------------------------------------------------------------------
;Returns: key code in buffer.KeyEvent.wVirtualKeyCode WORD size
; -------------------------------------------------------------------------

AnyKey PROC

again:

    INVOKE ReadConsoleInput,rHnd,offset buffer,1,offset howmany
    cmp buffer.EventType,KEY_EVENT
    jnz again

    cmp buffer.KeyEvent.bKeyDown,0
    jz again

    ret

AnyKey ENDP

; -------------------------------------------------------------------------

end start


Any improvement possible?

Thanks
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

just the other day, we were playing with RtlZeroMemory in one of the other threads
you might try that

Magnum


.DATA

    ValueOK     db  "Memory zeroed out.",0 
    Sample      db  "BOX",0
    Storage     db  "Co-ordinates of the Ark of the Covenant are...",0 

.data?                                       
     
    Storage1     db  256 dup(?)   
   
.CODE

start:

invoke  RtlZeroMemory, ADDR Storage, sizeof Storage ; in kernel32.inc

Have a great day,
                         Andy

jj2007

Well, he wants spaces, not zeroes, but a rep stosd is most probably the fastest way to fill an 8k buffer with spaces.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1252    cycles for RtlZeroMemory
1231    cycles for rep stosd.

Antariy

Hi, Frank!

If change this code to:

ClearBuffer PROC AddrBuffer:DWORD

    mov eax, AddrBuffer
    mov ecx, 1000
    mov ebx,20202020h ; change filling ebx to one command
cycle:

    mov [eax], ebx
    mov [eax+4], ebx
    add eax, 8
    dec ecx
    jnz cycle
       

    ret

ClearBuffer ENDP


This works?

Or this:


ClearBuffer PROC AddrBuffer:DWORD
    mov edx,edi
    mov ecx, 2000  <--- This is must be 2000. Thanks to Jochen!
    mov edi, AddrBuffer
    mov eax,20202020h
    rep stosd
    mov edi,edx
    ret

ClearBuffer ENDP


Test this.



Alex

EDITED. Jochen try to confuse me with one bug. So, Jochen, this is not my code, and in real-time of online internet this is hard - concentrate. I make suggestion, and only. Frank can find solution himself. I only show the way (no need in filling ebx by part, maybe used small code to do).

jj2007

Or, if you are not scared of SSE2:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1261    cycles for RtlZeroMemory
1233    cycles for rep stosd
1014    cycles for movdqa

Antariy

Quote from: jj2007 on August 24, 2010, 09:58:45 PM
Or, if you are not scared of SSE2:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1261    cycles for RtlZeroMemory
1233    cycles for rep stosd
1014    cycles for movdqa



Jochen, not confuse Frank with your experience :) All knows, what you are very like SSE2. What about movaps?



Alex

jj2007


frktons

Quote from: dedndave on August 24, 2010, 08:55:47 PM
just the other day, we were playing with RtlZeroMemory in one of the other threads
you might try that

Hi Dave.
The name RtlZeroMemory suggests this function clears to zero an area of memory.
It can be useful for other situations, here I need to clear to spaces [ASCII 32].

Quote from: Antariy on August 24, 2010, 09:56:01 PM
Hi, Frank!

If change this code to:

ClearBuffer PROC AddrBuffer:DWORD

    mov eax, AddrBuffer
    mov ecx, 1000
    mov ebx,20202020h ; change filling ebx to one command
cycle:

    mov [eax], ebx
    mov [eax+4], ebx
    add eax, 8
    dec ecx
    jnz cycle
       

    ret

ClearBuffer ENDP


This works?

Or this:


ClearBuffer PROC AddrBuffer:DWORD
    mov edx,edi
    mov ecx, 2000  <--- This is must be 2000. Thanks to Jochen!
    mov edi, AddrBuffer
    mov eax,20202020h
    rep stosd
    mov edi,edx
    ret

ClearBuffer ENDP


Test this.



Alex

EDITED. Jochen try to confuse me with one bug. So, Jochen, this is not my code, and in real-time of online internet this is hard - concentrate. I make suggestion, and only. Frank can find solution himself. I only show the way (no need in filling ebx by part, maybe used small code to do).


Thanks Alex, The first solution should gain some cycles compared to mine,
the second one using stosd should be faster according to your comments,
I have to test it and to understand how stosd works, it is the first time
I see it  :P

Quote from: jj2007 on August 24, 2010, 09:58:45 PM
Or, if you are not scared of SSE2:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1261    cycles for RtlZeroMemory
1233    cycles for rep stosd
1014    cycles for movdqa


Hi Jochen, if you post the code I can have a look at it.
I'm not scared of SSE2/3/4 but I don't know them so it could
be an occasion to get INTEL manuals working a little.  :lol

And last but not least, how does my version performs, compared to:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1261    cycles for RtlZeroMemory
1233    cycles for rep stosd
1014    cycles for movdqa

How much faster these methods are compared to the first I posted?
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: frktons on August 24, 2010, 11:00:51 PM
How much faster these methods are compared to the first I posted?

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1252    cycles for RtlZeroMemory
2024    cycles for FrkTons
1233    cycles for rep stosd
1014    cycles for movdqa
1013    cycles for movaps

frktons

Quote from: jj2007 on August 24, 2010, 11:12:56 PM
Quote from: frktons on August 24, 2010, 11:00:51 PM
How much faster these methods are compared to the first I posted?

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1252    cycles for RtlZeroMemory
2024    cycles for FrkTons
1233    cycles for rep stosd
1014    cycles for movdqa
1013    cycles for movaps


Thanks Jochen,
now I've an idea of the performance gap among the various methods.
Time to study them a little, tomorrow and the days to come.  :U

On my pc I've these results:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
1058    cycles for RtlZeroMemory
2022    cycles for FrkTons
1056    cycles for rep stosd
532     cycles for movdqa
531     cycles for movaps

1056    cycles for RtlZeroMemory
2318    cycles for FrkTons
1224    cycles for rep stosd
616     cycles for movdqa
613     cycles for movaps


--- ok ---


Interesting enough that RtlZeroMemory a probably C/C++ function, is
2:1 faster than the handwritten elementary assembly version I coded.  :P
Mind is like a parachute. You know what to do in order to use it :-)

ecube

I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.

jj2007

Quote from: E^cube on August 25, 2010, 12:08:08 AM
I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.

Last post of that thread:
Quote from: hutch-- on December 05, 2009, 11:14:25 AM
REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster

1. Anybody who is able to use Olly will quickly find out that RtlZeroMemory is just a wrapper for rep stosd
2. We had not beaten to death the option of doing it in SSE2 - that's why I added it. With success :green (but attention - this requires 16-byte alignment of the buffer).

frktons

Quote from: jj2007 on August 25, 2010, 06:30:46 AM
Quote from: E^cube on August 25, 2010, 12:08:08 AM
I really hate seeing dupe threads, but whatever I don't run this forum. we did speed testing for rtlzeromemory a long time ago here http://www.masm32.com/board/index.php?topic=6576.0 the overall results were rtlzeromemory is extremely fast for buffers 512 bytes+ I don't think any really beat it. However for under 512 bytes the fastest was based off the memfill procedure in masm32 I believe.

Last post of that thread:
Quote from: hutch-- on December 05, 2009, 11:14:25 AM
REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster

1. Anybody who is able to use Olly will quickly find out that RtlZeroMemory is just a wrapper for rep stosd
2. We had not beaten to death the option of doing it in SSE2 - that's why I added it. With success :green (but attention - this requires 16-byte alignment of the buffer).

I have a question about RtlZeroMemory: Could we call it in some way so that this function
fills the buffer with a character of our choice or it just zeroes the area? is it parameterless?

By the way, the SSE2 solution you posted looks much faster than it, so why not use it in modern
machine?  :P

Thanks
Mind is like a parachute. You know what to do in order to use it :-)

hutch--

Frank,

have a play with REP STOSD, apart from SSE you will struggle to do much better.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php