News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

m2m vs mrm

Started by n00b!, June 27, 2008, 04:08:02 PM

Previous topic - Next topic

n00b!

Hello,
I want to know which macro is quicker.

Using EAX as a buffer or pushing and popping to/from the stack.

Thanks in advance.

bozo

i believe a mov is always faster, but some will dispute that.
only way to know for sure is time your code.

zooba

Quote from: Kernel_Gaddafi on June 27, 2008, 04:13:46 PM
i believe a mov is always faster

That seems intuitive, I'll agree. However, I seem to recall some testing that happened a while ago (here somewhere, try searching) that found m2m was actually faster.

Caught quite a few people by surprise  :bg

Cheers,

Zooba :U

hutch--

Noob,

The usage depends on what you are doing, in the middle of a pile of messy API code, "m2m" is easily fast enough but the other macro that uses a register is usually faster so if the code you are writing is closer to the bare mnemonic end its probably a better choice.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

Running on my P3, I cannot find any circumstances where m2m is faster.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    ; -------------------------------------------------------
    ; This is an assembly-time random number generator based
    ; on code by George Marsaglia:
    ;   #define znew  ((z=36969*(z&65535)+(z>>16))<<16)
    ;   #define wnew  ((w=18000*(w&65535)+(w>>16))&65535)
    ;   #define MWC   (znew+wnew)
    ; -------------------------------------------------------

    @znew_seed@ = 362436069
    @wnew_seed@ = 521288629

    @rnd MACRO base:REQ
      LOCAL znew, wnew

      @znew_seed@ = 36969 * (@znew_seed@ AND 65535) + (@znew_seed@ SHR 16)
      znew = @znew_seed@ SHL 16

      @wnew_seed@ = 18000 * (@wnew_seed@ AND 65535) + (@wnew_seed@ SHR 16)
      wnew = @wnew_seed@ AND 65535

      EXITM <(znew + wnew) MOD base>
    ENDM

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      FOR mem,<m0,m1,m2,m3,m4,m5,m6,m7,m8,m9,ma,mb,mc,md,me,mf>
        mem dd 0
      ENDM
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov esi, alloc(1000000*4)

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      m2m m0, m1
      m2m m2, m3
      m2m m4, m5
      m2m m6, m7
      m2m m8, m9
      m2m ma, mb
      m2m mc, md
      m2m me, mf
    counter_end
    print ustr$(eax)," cycles, m2m sequential direct",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov eax, m1
      mov m0, eax
      mov eax, m3
      mov m2, eax
      mov eax, m5
      mov m4, eax
      mov eax, m7
      mov m6, eax
      mov eax, m9
      mov m8, eax
      mov eax, mb
      mov ma, eax
      mov eax, md
      mov mc, eax
      mov eax, mf
      mov me, eax
    counter_end
    print ustr$(eax)," cycles, mrm sequential direct",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      m2m [esi+0], [esi+4]
      m2m [esi+8], [esi+12]
      m2m [esi+16], [esi+20]
      m2m [esi+24], [esi+28]
      m2m [esi+32], [esi+36]
      m2m [esi+40], [esi+44]
      m2m [esi+48], [esi+52]
      m2m [esi+56], [esi+60]
    counter_end
    print ustr$(eax)," cycles, m2m sequential indirect",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov eax, [esi+4]
      mov [esi+0], eax
      mov eax, [esi+12]
      mov [esi+8], eax
      mov eax, [esi+20]
      mov [esi+16], eax
      mov eax, [esi+28]
      mov [esi+24], eax
      mov eax, [esi+36]
      mov [esi+32], eax
      mov eax, [esi+44]
      mov [esi+40], eax
      mov eax, [esi+52]
      mov [esi+48], eax
      mov eax, [esi+60]
      mov [esi+56], eax
    counter_end
    print ustr$(eax)," cycles, mrm sequential indirect",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 8
        m2m [esi+@rnd(250000)*4],[esi+@rnd(250000)*4]
      ENDM
    counter_end
    print ustr$(eax)," cycles, m2m random indirect",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 8
        mov eax, [esi+@rnd(250000)*4]
        mov [esi+@rnd(250000)*4], eax
      ENDM
    counter_end
    print ustr$(eax)," cycles, mrm random indirect",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


28 cycles, m2m sequential direct
6 cycles, mrm sequential direct

28 cycles, m2m sequential indirect
10 cycles, mrm sequential indirect

55 cycles, m2m random indirect
36 cycles, mrm random indirect

eschew obfuscation

zooba

Quote from: MichaelW on June 28, 2008, 08:13:06 AM
Running on my P3, I cannot find any circumstances where m2m is faster.

Guess I was imagining things then :bg . Oh well. :U

jj2007

On a Celeron. Looks strange...

94 cycles, m2m sequential direct
1 cycles, mrm sequential direct

84 cycles, m2m sequential indirect
2 cycles, mrm sequential indirect

260 cycles, m2m random indirect
172 cycles, mrm random indirect

n00b!

For what are the cycles and esi?
If they are lower the method is quicker?

PS: Thanks for your help.
PS2: I have no timers.asm :-(

daydreamer

Quote from: hutch-- on June 27, 2008, 11:33:23 PM
Noob,

The usage depends on what you are doing, in the middle of a pile of messy API code, "m2m" is easily fast enough but the other macro that uses a register is usually faster so if the code you are writing is closer to the bare mnemonic end its probably a better choice.
in the case you have code you depend on need many as possible general regs, shouldnt it be time to make a mxm macro that can be used instead of mrm?
where mxm makes use of xmm0

MichaelW

Quote from: jj2007 on June 28, 2008, 09:25:21 AM
On a Celeron. Looks strange...

If you are running on a P4 Celeron, or any other P4, then the instruction sequences are too short to get meaningful cycle counts. I considered this, but I had already spent more time on it than I had. For a quick, crude fix you could modify each test to something like this:

counter_begin 1000, HIGH_PRIORITY_CLASS
  REPEAT 100
    m2m m0, m1
    m2m m2, m3
    m2m m4, m5
    m2m m6, m7
    m2m m8, m9
    m2m ma, mb
    m2m mc, md
    m2m me, mf
  ENDM 
counter_end

eschew obfuscation

NightWare

Quote from: zooba on June 28, 2008, 08:40:39 AM
Guess I was imagining things then :bg . Oh well. :U
not exactly, we've spoken of that here : http://www.masm32.com/board/index.php?topic=9110.0

Mark Jones

For the bigger machines, here's Michael's code modified to expand each block 1000x. The last two blocks were changed to perform 8 tests like the others and the range of the random values was greatly increased (to around 0-3MB or so.) Included is an executable, a RadASM project file, and timers.asm from http://www.masm32.com/board/index.php?topic=770.0

Quote from: AMD X64 4000+
33371 cycles, m2m sequential direct
18964 cycles, mrm sequential direct

34587 cycles, m2m sequential indirect
8116 cycles, mrm sequential indirect

624616 cycles, m2m random indirect
392243 cycles, mrm random indirect

[attachment deleted by admin]
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: Mark Jones on July 07, 2008, 06:31:42 PM
For the bigger machines, here's Michael's code modified to expand each block 1000x.

Celeron:

99826 cycles, m2m sequential direct
19544 cycles, mrm sequential direct

98680 cycles, m2m sequential indirect
18225 cycles, mrm sequential indirect

2620671 cycles, m2m random indirect
2331846 cycles, mrm random indirect

NightWare

on core2duo 2ghz :
36581 cycles, m2m sequential direct
26111 cycles, mrm sequential direct

37038 cycles, m2m sequential indirect
15349 cycles, mrm sequential indirect

149714 cycles, m2m random indirect
93651 cycles, mrm random indirect


it's clear there is a speed up for push/pop on core2 (compared with p3/p4), but macros compared here don't do exactly the same thing, with m2m there is registers preservation, it's not the case with mrm (especially eax, the most used register to return values...). it's why m2m is generally used more often... when you have spent your time once with mrm, you remember later you must use m2m (unless you are sure you will not touch/add things to your algo later... hmm... is it something possible ?)  :wink

Biterider

Hi
This is my implementation of m2m.
I use it to freely play with the register that transfers the value or, if there is no reg available, to fall back to the push/pop version.

m2m macro DstMem:req, SrcMem:req, AuxReg
    ifb <AuxReg>
      push SrcMem
      pop DstMem
    else
      mov AuxReg, SrcMem
      mov DstMem, AuxReg
    endif
endm


The advantage is that if you don't provide the 3rd parameter you are compatible with existing code using push/pop.

Regards,

Biterider