Hi all.

rep stosd was considered the fastest way to inizialize a block of
memory in 32 bit assembly.
Now SSE instructions beat it on INTEL machine at least.
It's my opinion that in 64 bit machines, working with 64 bit native operations,
we could get better results than SSE mnemonics just using general 64 bit registers.

Could anyone translate the following code into a 64 bit code
and post the performance it has versus the 32 bit version?

Any help is welcome


include \masm32\include\

ClearBuffer PROTO :DWORD



    buf2clear CHAR_INFO 2000 dup (<>)




    INVOKE ClearBuffer, ADDR buf2clear
    print "Clearing done",13,10,13,10

finish: INVOKE ExitProcess,0



ClearBuffer PROC AddrBuffer:DWORD

    push edi

    mov ecx, 2000 

    mov edi, AddrBuffer

    mov eax,20202020h

    rep stosd

    pop edi


ClearBuffer ENDP

; -------------------------------------------------------------------------

end start

Nobody in there using JWASM, GoASM or any 64 bit ASM that can change
a couple of instructions and post the source and exe to test on a 64 bit OS?

Or anybody who can tell me how to use JWASM to compile with COMMAND LINE
OPTIONS for 64 bit OS?

Would an ml64 version be of use to you?

I don't know if I can use ML64, and how to use it.
What about include \masm32\include\ ?
Anything to change in the source other than

    push edi

    mov ecx, 2000

    mov edi, AddrBuffer

    mov eax,20202020h

    rep stosd

    pop edi

That should be translated something like:

    push rdi

    mov rcx, 1000

    mov rdi, AddrBuffer

    mov rax,2020202020202020h

    rep stosq

    pop rdi


and what have I to pass to ML64.EXE for parameters to compile
to 64 native bit?


Sorry, I just realized my Visual Studio 2010 Professional trial has expired and the Express Edition does not build 64-bit without some substantial modifications.  So, I'm not set up to build 64-bit with ml64. Using include \masm32\include\ won't work because it is for 32-bit. Your 64-bit translation of the core code looks good. I'm going to work on getting my 2010 Express Edition building 64-bit. It should be possible, I was able to do it with the 2008 Express Edition.

Maybe someone else could build it with GoASM or JWASM in the mean time.


If anyone is interested I got VC++ 2010 Express Edition building x64 C programs by installing the latest Windows SDK and setting the VC++ LIB directory to the one from the SDK for x64. That's it, it works.

Now to get it to build x64 MASM programs. Trouble is, I can't find ml64.exe anywhere, and I uninstalled the VC++ Pro trial.


I've never had any C stuff, only SDKs, but found ML64 here:

C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64
Well, I've got ML64 in the same directory sinsi, the trouble is how to translate into
64 bit a simple routine like the one I posted, assemble it, and see the performance
using some kind of CPU timings, all things beyond my actual knowledge  :red

The timing is simple, just wrap ClearBuffer in a couple of 'rdtsc' for an easy test, the problem is the lack of a print/inkey macro for 64-bit.
Quote from: sinsi on September 05, 2010, 04:48:13 AM
The timing is simple, just wrap ClearBuffer in a couple of 'rdtsc' for an easy test, the problem is the lack of a print/inkey macro for 64-bit.

A workaround may be to use MSVCRT's printf and kbhit.

Here's a sample:

;--- Win64 "hello world" console application.
;--- uses CRT functions.
;--- assemble: JWasm -win64 Win64_6.asm
;---       or: ml64 -c Win64_6.asm
;--- link:     Link /subsystem:console Win64_6.obj

option casemap:none

includelib msvcrt.lib

externdef printf : near
externdef kbhit : near


string   db 10,"hello, world.",10,0


main proc
sub rsp, 28h        ; space for 4 arguments + 16byte aligned stack
mov rcx, offset string
call printf
call kbhit
xor eax, eax
add rsp,28h
main endp


You'll need a 64-bit version of msvcrt.lib.


;--- Win64 "hello world" console application.
;--- uses CRT functions.
;--- assemble: JWasm -win64 Win64_6.asm
;---       or: ml64 -c Win64_6.asm
;--- link:     Link /subsystem:console Win64_6.obj

option casemap:none

includelib msvcrt.lib

externdef printf : near
externdef kbhit : near


string   db 10,"hello, world.",10,0


main proc
sub rsp, 28h        ; space for 4 arguments + 16byte aligned stack
mov rcx, offset string
call printf
call kbhit
xor eax, eax
add rsp,28h
main endp


You'll need a 64-bit version of msvcrt.lib.

Thanks japheth.

I was thinking about using your JWASM for this task, but I lack the necessary
expertise. By the way I'll try to convert myself the routine, and add a couple
of  rdtsc as sinsi suggested, and see if I can get it working.

Next week I hope to find the available time to undertake the task.

Thanks japheth. I noticed you aligned the stack, does calling the C dll need alignment too?
I usually align it at the start of a proc even if I don't use an API, mainly so I know that RSP will be 8-aligned if I call another proc.

What I am trying to get at is, the windows API requires 16-byte alignment but does anything else? Or is it just a good habit to get into?
It doesn't really cost to much to do, that's why I get into the habit even if it's unneeded.

>Win64 SEH.
yuck *shudder*
It really isnt a performance problem. As you see above, including the extra space to maintain alignment was free, and its even cheaper than free if more calls that also have 5 or less parameters are made (rsp was already set to ensure alignment, so no need to mess with it again)

Also note that you do not have to maintain alignment if you make no calls, so leaf functions dont need to worry about it and can use your more typical push and pop mechanics for temporarily saving registers.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.