Hello,
I want to know which macro is quicker.
Using EAX as a buffer or pushing and popping to/from the stack.
Thanks in advance.
i believe a mov is always faster, but some will dispute that.
only way to know for sure is time your code.
Quote from: Kernel_Gaddafi on June 27, 2008, 04:13:46 PM
i believe a mov is always faster
That seems intuitive, I'll agree. However, I seem to recall some testing that happened a while ago (here somewhere, try searching) that found m2m was actually faster.
Caught quite a few people by surprise :bg
Cheers,
Zooba :U
Noob,
The usage depends on what you are doing, in the middle of a pile of messy API code, "m2m" is easily fast enough but the other macro that uses a register is usually faster so if the code you are writing is closer to the bare mnemonic end its probably a better choice.
Running on my P3, I cannot find any circumstances where m2m is faster.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; -------------------------------------------------------
; This is an assembly-time random number generator based
; on code by George Marsaglia:
; #define znew ((z=36969*(z&65535)+(z>>16))<<16)
; #define wnew ((w=18000*(w&65535)+(w>>16))&65535)
; #define MWC (znew+wnew)
; -------------------------------------------------------
@znew_seed@ = 362436069
@wnew_seed@ = 521288629
@rnd MACRO base:REQ
LOCAL znew, wnew
@znew_seed@ = 36969 * (@znew_seed@ AND 65535) + (@znew_seed@ SHR 16)
znew = @znew_seed@ SHL 16
@wnew_seed@ = 18000 * (@wnew_seed@ AND 65535) + (@wnew_seed@ SHR 16)
wnew = @wnew_seed@ AND 65535
EXITM <(znew + wnew) MOD base>
ENDM
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
FOR mem,<m0,m1,m2,m3,m4,m5,m6,m7,m8,m9,ma,mb,mc,md,me,mf>
mem dd 0
ENDM
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov esi, alloc(1000000*4)
invoke Sleep, 3000
counter_begin 1000, HIGH_PRIORITY_CLASS
m2m m0, m1
m2m m2, m3
m2m m4, m5
m2m m6, m7
m2m m8, m9
m2m ma, mb
m2m mc, md
m2m me, mf
counter_end
print ustr$(eax)," cycles, m2m sequential direct",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov eax, m1
mov m0, eax
mov eax, m3
mov m2, eax
mov eax, m5
mov m4, eax
mov eax, m7
mov m6, eax
mov eax, m9
mov m8, eax
mov eax, mb
mov ma, eax
mov eax, md
mov mc, eax
mov eax, mf
mov me, eax
counter_end
print ustr$(eax)," cycles, mrm sequential direct",13,10,13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
m2m [esi+0], [esi+4]
m2m [esi+8], [esi+12]
m2m [esi+16], [esi+20]
m2m [esi+24], [esi+28]
m2m [esi+32], [esi+36]
m2m [esi+40], [esi+44]
m2m [esi+48], [esi+52]
m2m [esi+56], [esi+60]
counter_end
print ustr$(eax)," cycles, m2m sequential indirect",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
mov eax, [esi+4]
mov [esi+0], eax
mov eax, [esi+12]
mov [esi+8], eax
mov eax, [esi+20]
mov [esi+16], eax
mov eax, [esi+28]
mov [esi+24], eax
mov eax, [esi+36]
mov [esi+32], eax
mov eax, [esi+44]
mov [esi+40], eax
mov eax, [esi+52]
mov [esi+48], eax
mov eax, [esi+60]
mov [esi+56], eax
counter_end
print ustr$(eax)," cycles, mrm sequential indirect",13,10,13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 8
m2m [esi+@rnd(250000)*4],[esi+@rnd(250000)*4]
ENDM
counter_end
print ustr$(eax)," cycles, m2m random indirect",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 8
mov eax, [esi+@rnd(250000)*4]
mov [esi+@rnd(250000)*4], eax
ENDM
counter_end
print ustr$(eax)," cycles, mrm random indirect",13,10,13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
28 cycles, m2m sequential direct
6 cycles, mrm sequential direct
28 cycles, m2m sequential indirect
10 cycles, mrm sequential indirect
55 cycles, m2m random indirect
36 cycles, mrm random indirect
Quote from: MichaelW on June 28, 2008, 08:13:06 AM
Running on my P3, I cannot find any circumstances where m2m is faster.
Guess I was imagining things then :bg . Oh well. :U
On a Celeron. Looks strange...
94 cycles, m2m sequential direct
1 cycles, mrm sequential direct
84 cycles, m2m sequential indirect
2 cycles, mrm sequential indirect
260 cycles, m2m random indirect
172 cycles, mrm random indirect
For what are the cycles and esi?
If they are lower the method is quicker?
PS: Thanks for your help.
PS2: I have no timers.asm :-(
Quote from: hutch-- on June 27, 2008, 11:33:23 PM
Noob,
The usage depends on what you are doing, in the middle of a pile of messy API code, "m2m" is easily fast enough but the other macro that uses a register is usually faster so if the code you are writing is closer to the bare mnemonic end its probably a better choice.
in the case you have code you depend on need many as possible general regs, shouldnt it be time to make a mxm macro that can be used instead of mrm?
where mxm makes use of xmm0
Quote from: jj2007 on June 28, 2008, 09:25:21 AM
On a Celeron. Looks strange...
If you are running on a P4 Celeron, or any other P4, then the instruction sequences are too short to get meaningful cycle counts. I considered this, but I had already spent more time on it than I had. For a quick, crude fix you could modify each test to something like this:
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 100
m2m m0, m1
m2m m2, m3
m2m m4, m5
m2m m6, m7
m2m m8, m9
m2m ma, mb
m2m mc, md
m2m me, mf
ENDM
counter_end
Quote from: zooba on June 28, 2008, 08:40:39 AM
Guess I was imagining things then :bg . Oh well. :U
not exactly, we've spoken of that here : http://www.masm32.com/board/index.php?topic=9110.0
For the bigger machines, here's Michael's code modified to expand each block 1000x. The last two blocks were changed to perform 8 tests like the others and the range of the random values was greatly increased (to around 0-3MB or so.) Included is an executable, a RadASM project file, and timers.asm from http://www.masm32.com/board/index.php?topic=770.0
Quote from: AMD X64 4000+
33371 cycles, m2m sequential direct
18964 cycles, mrm sequential direct
34587 cycles, m2m sequential indirect
8116 cycles, mrm sequential indirect
624616 cycles, m2m random indirect
392243 cycles, mrm random indirect
[attachment deleted by admin]
Quote from: Mark Jones on July 07, 2008, 06:31:42 PM
For the bigger machines, here's Michael's code modified to expand each block 1000x.
Celeron:
99826 cycles, m2m sequential direct
19544 cycles, mrm sequential direct
98680 cycles, m2m sequential indirect
18225 cycles, mrm sequential indirect
2620671 cycles, m2m random indirect
2331846 cycles, mrm random indirect
on core2duo 2ghz :
36581 cycles, m2m sequential direct
26111 cycles, mrm sequential direct
37038 cycles, m2m sequential indirect
15349 cycles, mrm sequential indirect
149714 cycles, m2m random indirect
93651 cycles, mrm random indirect
it's clear there is a speed up for push/pop on core2 (compared with p3/p4), but macros compared here don't do exactly the same thing, with m2m there is registers preservation, it's not the case with mrm (especially eax, the most used register to return values...). it's why m2m is generally used more often... when you have spent your time once with mrm, you remember later you must use m2m (unless you are sure you will not touch/add things to your algo later... hmm... is it something possible ?) :wink
Hi
This is my implementation of m2m.
I use it to freely play with the register that transfers the value or, if there is no reg available, to fall back to the push/pop version.
m2m macro DstMem:req, SrcMem:req, AuxReg
ifb <AuxReg>
push SrcMem
pop DstMem
else
mov AuxReg, SrcMem
mov DstMem, AuxReg
endif
endm
The advantage is that if you don't provide the 3rd parameter you are compatible with existing code using push/pop.
Regards,
Biterider
If your program doesn't use any x87 floating point instructions, you can use the MMX registers.
With slight modification to MichaelW's code (swapping mov with movd and eax with mm0)
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
.mmx
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
counter_begin 1000, HIGH_PRIORITY_CLASS
movd mm0, m1
movd m0, mm0
movd mm0, m3
movd m2, mm0
movd mm0, m5
movd m4, mm0
movd mm0, m7
movd m6, mm0
movd mm0, m9
movd m8, mm0
movd mm0, mb
movd ma, mm0
movd mm0, md
movd mc, mm0
movd mm0, mf
movd me, mm0
counter_end
counter_begin 1000, HIGH_PRIORITY_CLASS
movd mm0, [esi+4]
movd [esi+0], mm0
movd mm0, [esi+12]
movd [esi+8], mm0
movd mm0, [esi+20]
movd [esi+16], mm0
movd mm0, [esi+28]
movd [esi+24], mm0
movd mm0, [esi+36]
movd [esi+32], mm0
movd mm0, [esi+44]
movd [esi+40], mm0
movd mm0, [esi+52]
movd [esi+48], mm0
movd mm0, [esi+60]
movd [esi+56], mm0
counter_end
REPEAT 8
movd mm0, [esi+@rnd(250000)*4]
movd [esi+@rnd(250000)*4], mm0
ENDM
If you do have x87 floating point use the emms instruction
to transition from mmx to x87 fp.