News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Translating from 32 bit to 64 bit

Started by frktons, August 25, 2010, 08:00:29 PM

Previous topic - Next topic

frktons

The state of the art, so to speak, is the following:

A working example that need to be timed with some RDTSC.

It is x64, so you can run/compile it only on those machine/OS.
Attached the source so far done and the assembled example.

See you in a few days.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Hi!

Frank, this is my trying for make x64 app with ML64.
This is my 3rd app for x64, so, don't abuse very much :P

I don't have MSVC10, and don't have needed stuffs for compiling x64 app with MS tools. So, if this source is not recompilable - try change API imports names.


At start of file placed SIZE_OF_BUFFER macro, which is used for setting of testing buffer size.
I suggest set it to 32MB size (or size which is much bigger than size of L2 cache).


Also I add SSE2 code, as in this "http://www.masm32.com/board/index.php?topic=14685.msg120025#msg120025" test.



Test this please, and say: work this or not. I cannot run and debug this...



Alex
P.S. Somebody can post any x64 Windows DLL? GDI32.DLL usually is the small basical dll under NTs.

GregL

Frank,

The main procedure requires sub rsp, 40 at the beginning of it.






GregL

Frank,

I also corrected my explanation of sub rsp, 40 in my above post.

Antariy

Hi!

Somebody can post results for my code "clearbufx64_2.zip" at 2 posts above?


Don't think what I say about this without reason: I cannot test code which I make just because I'm under 32bits.
So, if somebody can test and post results - this would be nice. You can play with SIZE_OF_BUFFER macro at start of file. Need no make it much bigger than L2 cache of testing CPU. For example - 32MB will be nice at this time for most CPUs :)

Thanks.



Alex

GregL

Alex,

I got a minute to try out your code. It runs and looks like this

Clearing done
6576 clocks for a 8000 bytes buffer with using REP STOSQ

REP STOSQClearing done
13296 clocks for a 8000 bytes buffer with using MOVNTDQ

REP STOSQ


with the cursor right after REP STOSQ. If I press a key it then exits.

BTW, CPU is Pentium D 940 (3.2 GHz).

Antariy

Quote from: GregL on September 11, 2010, 08:31:15 PM
Alex,

I got a minute to try out your code. It runs and looks like this

Thanks, Greg!

I forgotten to terminate msg1, so this is reason for strange printing :)
I replace CRLF to CRLF,CRLF and forgot the null.

    msg1 BYTE "%u clocks for a %u bytes buffer with using %s",13,10,13,10


Now I fix the code (i.e. - data) - terminate this message.
And I attach fixed version to this post.

I change testing buffer size to 32MB also - this must make other results (more good) for REP STOSQ for 64bit machine.

I ask to anybody who have small free time and have x64 machine - test this please.

Thanks.



Alex

sinsi

Q6600 quad at 2.4GHz

Clearing done
45371592 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
19012428 clocks for a 33554432 bytes buffer with using MOVNTDQ


The old program was all over the place, you might need to use SetProcessAffinity to lock it to one CPU

Clearing done
3456 clocks for a 8000 bytes buffer with using REP STOSQ

REP STOSQClearing done
6345 clocks for a 8000 bytes buffer with using MOVNTDQ

REP STOSQClearing done
2601 clocks for a 8000 bytes buffer with using REP STOSQ

REP STOSQClearing done
7047 clocks for a 8000 bytes buffer with using MOVNTDQ

REP STOSQClearing done
1908 clocks for a 8000 bytes buffer with using REP STOSQ

REP STOSQClearing done
9936 clocks for a 8000 bytes buffer with using MOVNTDQ

REP STOSQClearing done
2898 clocks for a 8000 bytes buffer with using REP STOSQ

REP STOSQClearing done
6858 clocks for a 8000 bytes buffer with using MOVNTDQ

REP STOSQClearing done
2313 clocks for a 8000 bytes buffer with using REP STOSQ

REP STOSQClearing done
7731 clocks for a 8000 bytes buffer with using MOVNTDQ

Light travels faster than sound, that's why some people seem bright until you hear them.

frktons

On my Core 2 Duo 2.6 Ghz I get similar results as Sinsi's ones:


Clearing done
54803034 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
18934740 clocks for a 33554432 bytes buffer with using MOVNTDQ


And again:


Clearing done
48940101 clocks for a 33554432 bytes buffer with using REP STOSQ

Clearing done
17176869 clocks for a 33554432 bytes buffer with using MOVNTDQ


I didn't know that movntdq for big buffer is still much faster than REP/STOSQ
even on 64 bit processing.

Thanks Alex for completing this short prog during my computerless/internetless days.  :U
And Thanks to Greg and the guys who gave me inspirations and examples to test it.  :clap:


Frank.

Mind is like a parachute. You know what to do in order to use it :-)

Antariy

Hi!

Thanks for suggestion, Sinsi!

I make affinity in test, which attach to post.
Initially I don't make affinity due to lazyness - I'm have not any import libraryes for x64, so - I must make them manually, so - I economize APIs imported => economize on making the import libs :)

Frank, only Intel knows implementation of his REP STOSQ. If it don't use non-temporal writes to memory, it cannot beat SSE, because cache is messed very much with big buffers.


Sinsi, is the GetCurrentProcess under x64 return -1 as under 32bit NTs?



Alex