stb & lod Macros

Neil · November 19, 2008, 03:13:12 PM

Having used these macros, am I correct in assuming that their word & dword equivalents would be faster than their string counterparts or
would the inc or add instructions make them slower?

Tedd · November 19, 2008, 03:50:19 PM

The string instructions can be faster under some circumstances, as long there is the right amount of data being handled - there's a cut-off where it becomes faster to do it the 'long' way (or is it the other way around :lol)
There is another thread about this... somewhere.

Neil · November 19, 2008, 03:57:41 PM

Tedd,
Are you saying that mov [edi],eax
add edi,4

is faster than stosd

Neil · November 19, 2008, 04:04:39 PM

Tedd,
I've searched for this other thread you mentioned, but can't find it maybe I'm putting in the wrong search parameters.

hutch-- · November 19, 2008, 04:18:17 PM

Neil,

There are a couple of special cases with string instructions, REP movsd and REP stosd, it only works with the REP prefix, separately they are very slow and should be avoided. The REP string instructions do outperform the normal integer instructions in some contexts, if both source and destination are not in cache at the same time they are faster as they appear to handle non-temporal writes in much the same way as the specialised SSE instructions where the normal interger instructions don't have that option.

Most of us grew up with the string instructions but since the early PIIs onwards they have not been competitive in most instances, loading a register and incrementing the index is almost always faster and sometimes by a large amount.

MichaelW · November 19, 2008, 04:25:56 PM

I couldn't think of a good real-world test for the macros, but on my P3 there appears to be no contest.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff db 8 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff
      REPEAT 10
        lodsb
      ENDM
    counter_end
    print ustr$(eax), " cycles, lodsb",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff
      REPEAT 10
        lob
      ENDM
    counter_end
    print ustr$(eax), " cycles, lob",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov edi, OFFSET buff
      REPEAT 10
        stosb
      ENDM
    counter_end
    print ustr$(eax), " cycles, stosb",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov edi, OFFSET buff
      REPEAT 10
        stb
      ENDM
    counter_end
    print ustr$(eax), " cycles, stb",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


17 cycles, lodsb
5 cycles, lob
16 cycles, stosb
8 cycles, stb

And this is more or less of a worst-case comparison:

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff1 dd 1000 dup(0)
      buff2 dd 1000 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
      rep movsd
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
    @@:
      mov eax, [esi]
      mov [edi], eax
      dec ecx
      jnz @B
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
      rep movsd
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
    @@:
      mov eax, [esi]
      mov [edi], eax
      dec ecx
      jnz @B
    counter_end
    print ustr$(eax), " cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select

 
926 cycles
3005 cycles
924 cycles
3005 cycles

Neil · November 19, 2008, 04:45:59 PM

Thanks Hutch & Michael,
Michael your demo code gives a great illustration of the speed difference, more than 3 times as fast using the lob macro, it also demostrates how the use of rep speeds up the string instructions. Now I have one more question regarding writing macros to replace stosw, stosd etc, are 2 incs or 4 incs faster or slower than adding the appropriate offset i.e. there must be a time when n number of incs becomes slower than add n.

MichaelW · November 19, 2008, 05:03:40 PM

I think add would probably be faster than two or more incs.

Neil · November 19, 2008, 05:09:58 PM

Well, add it is then :U

Neil · November 19, 2008, 05:13:52 PM

Mind you, I have a few instances where I use std, this is starting to get complicated :toothy

KeepingRealBusy · November 19, 2008, 05:21:57 PM

Neil,

I think it all depends on what you are doing

Code Select


mov ecx count
dec ecx
mov esi,OFFSET str
mov edi,OFFSET buf
@@:
mov eax,[esi+ecx*4]
mov [edi+ecx*4],eax
dec ecx
jns @b

Depending on what you're doing, you only need an increment or a decrement.

Dave

Neil · November 19, 2008, 07:31:22 PM

Dave,
That's an interesting code snippet, but (Correct me if I'm wrong) according to Hutch & Michael rep movsd would be much quicker.

News:

stb & lod Macros